ETL: A Simple Package to Load Data from Views

A common, native way to load data into tables in Oracle is to create a view to load the data from. Depending on how the view is built, you can either refresh (i.e. overwrite) the data in a table or append fresh data to the table. Here, I present a simple package ETL that only requires you to maintain a configuration table and obviously the source views (or tables) and target tables.

Basic Setup

ETL is for single-source/single-target loads. That means that for each target there is a source view (or table), and each source view can only be used to load data into one target table at a time. Fancier logic, where a single source object is used to feed multiple tables in one go, is not supported by ETL because it’s designed to be simple.

At its heart is the configuration table ETL_CONF, which allows you to specify a (remote) table or view in any schema the user has access to. A trigger ETL_CONF_BEF_INS checks before each insert whether the source and target tables exist. It does a few other checks too, just to make sure you’re not inserting garbage. Another trigger, ETL_CONF_AFT_MOD is fired after modifications on ETL_CONF. These modifications are then logged in ETL_CONF_LOG, so that you have an audit trail of who changed what and when.

Configurations can be grouped into so-called categories, which are often useful when you want to execute them in batches, or simply to organize them.

You can run the ETL package with LOAD_TAB_FROM_VIEW and/or LOAD_ALL_TABS, the signatures of which are shown below:

PROCEDURE load_tab_from_view
(
  target_own_in   adm_tab_load_conf.target_own%TYPE
, target_obj_in   adm_tab_load_conf.target_obj%TYPE
);

PROCEDURE load_all_tabs
(
  category_in        adm_tab_load_conf.load_category%TYPE := NULL
, resume_load_at_in  adm_tab_load_conf.load_order%TYPE    := NULL
);

The procedure LOAD_ALL_TABS can execute all view-to-table loads in sequence when no parameters are supplied, or it can load a specific category either in its entirety or from a certain sequence number (using RESUME_LOAD_AT_IN). This is for instance useful when there was an exception that cancelled the remainder of a category, and you’ve now fixed the bug but do not want the loaders to run from scratch again.

Why does ETL stop on exceptions? Well, the sequence of the load is specified in LOAD_ORDER and the assumption is that there may be dependencies among configurations in the sense that data has to be inserted into a particular table before another load can be executed.

You’ll see entries in the execution log for clean-up procedures. By default, log entries are purged after 100 days, so that the logs do not become unwieldy.

By the way, the code is on my repository, where you’ll find a deployment script in the scripts folder. It installs all packages in the repository, but then again most are needed anyway because ETL builds on the infrastructure that’s already there. The ETL-related tables and views have comments, so that the data dictionary is up to date.

Refreshing versus Appending

When a configuration is set to refresh (‘REF’), all data in the target table is removed and everything from the source table or view is inserted into the emptied table.

Appending data (‘APD’) requires the source table/view to add only fresh data. In the case of a view, this means having a WHERE clause that looks at data that’s not already in the table. For large tables, indexes are obviously required to make that work well. For append configurations it is also possible to specify a single-column WHERE clause to be used to purge ‘stale’ data. The idea is that append configurations, like logs, can grow indefinitely, hence the need to sweep some data. The columns ARCHIVE_COL_NAME, ARCHIVE_COL_OPER, and ARCHIVE_COL_VALUE can be used to define a single-column predicate to remove old data.

For example, suppose you have a table with an DIAGNOSED_AT column to indicate when a patient was diagnosed, that is, it’s a DATE or TIMESTAMP. Now, you’re tasked with automatically purging data older than 10 days. You’d set ARCHIVE_COL_NAME equal to DIAGNOSED_AT, ARCHIVE_COL_OPER to ‘<’, and ARCHIVE_COL_VALUE to SYSDATE - 10.

The reason you are not allowed to specify the WHERE clause yourself, which would have been slightly easier to implement, is that it may lead to improper use of the feature. Append configurations ought to have proper indexes on the target table, and a free-style purge predicate seems too easy to abuse the built-in feature. You can of course modify that easily if you’re not happy with this restriction.

The columns are called ARCHIVE_COL_* because at some point I might want to add logic to move the data to an archive table rather than the rubbish bin; the option between archive and purge would obviously be configurable.

Additional Views

There are also three views that have been created for reasons of convenience. The first one, ETL_STATS collects some statistics, so that you have an idea about the run-time performance of each configuration. ETL_HIST combines the information from ETL_STATS and the administration table ETL_CONF, so it’s your go-to object for both the load order and statistics. ETL_RECENT is the last view that comes with ETL, and it gives you the numbers from the last five successful executions, which is especially helpful when changes to the source views have been made and the current execution time is markedly different from the historical one.

I’ve also added a simple view ERROR_LOG_RECENT that shows the errors from the last hour. This is mainly useful when developing.

Scheduler

Because ETL provides two procedures, you can easily combine it with Oracle’s built-in scheduler. That way you can periodically load data with minimal effort. After all, you do not need to create scheduler jobs for every single object.

For instance, suppose you have a category for master data and you wish to re-load master data every hour. You can then do the following:

CREATE OR REPLACE PROCEDURE load_master
AS
BEGIN
  etl.load_all_tabs(category_in => 'MASTER');
END load_master;

BEGIN
  dbms_scheduler.create_job( job_name => 'load_master_tables',
                             job_type => 'STORED_PROCEDURE',
                             job_action => 'LOAD_MASTER',
                             repeat_interval => 'FREQ=HOURLY',
                             enabled => TRUE,
                             comments => 'Job to load all master data');

END;

When scheduling jobs, you obviously have to make sure that the previous run had sufficient time to complete before attempting another one.

That’s all, folks. Now go forth and tinker!