Preprocessor For External Tables

Before External Tables existed in the Oracle database, loading from flat files was done via SQL*Loader. One option that some used was to have a compressed text file and load it with SQL*Loader via a named pipe. This allowed one not to have to extract the file, which could be several times the size of the compressed file. As of 11.1.0.7, a similar feature is now available for External Tables (and will be in 10.2.0.5). This enhancement is a result of Bug 6522622 which is mentioned in the Bugs fixed in the 11.1.0.7 Patch Set note. Unfortunately it appears that there aren’t any notes on how to actually use the External Table Preprocessor so allow me to give some insight into its use.

The PREPROCESSOR clause is part of the record_format_info clause. The syntax of the PREPROCESSOR clause is as follows:

PREPROCESSOR [directory_spec:] file_spec [preproc_options_spec]

It’s pretty straight forward when you see an example. Line 31 contains the new clause.

create or replace directory load_dir as '/data/tpc-ds/flat_files/1gb';
create or replace directory log_dir  as '/tmp';
create or replace directory exec_dir as '/bin';
--
-- ET_CUSTOMER_ADDRESS
--
DROP TABLE ET_CUSTOMER_ADDRESS;
CREATE TABLE ET_CUSTOMER_ADDRESS
(
    "CA_ADDRESS_SK"                  NUMBER
   ,"CA_ADDRESS_ID"                  CHAR(16)
   ,"CA_STREET_NUMBER"               CHAR(10)
   ,"CA_STREET_NAME"                 VARCHAR2(60)
   ,"CA_STREET_TYPE"                 CHAR(15)
   ,"CA_SUITE_NUMBER"                CHAR(10)
   ,"CA_CITY"                        VARCHAR2(60)
   ,"CA_COUNTY"                      VARCHAR2(30)
   ,"CA_STATE"                       CHAR(2)
   ,"CA_ZIP"                         CHAR(10)
   ,"CA_COUNTRY"                     VARCHAR2(20)
   ,"CA_GMT_OFFSET"                  NUMBER
   ,"CA_LOCATION_TYPE"               CHAR(20)
)
ORGANIZATION EXTERNAL
(
   TYPE oracle_loader
   DEFAULT DIRECTORY load_dir
   ACCESS PARAMETERS
   (
      RECORDS DELIMITED BY NEWLINE
      PREPROCESSOR exec_dir:'gunzip' OPTIONS '-c'
      BADFILE log_dir: 'CUSTOMER_ADDRESS.bad'
      LOGFILE log_dir: 'CUSTOMER_ADDRESS.log'
      FIELDS TERMINATED BY '|'
      MISSING FIELD VALUES ARE NULL
      (
          "CA_ADDRESS_SK"
         ,"CA_ADDRESS_ID"
         ,"CA_STREET_NUMBER"
         ,"CA_STREET_NAME"
         ,"CA_STREET_TYPE"
         ,"CA_SUITE_NUMBER"
         ,"CA_CITY"
         ,"CA_COUNTY"
         ,"CA_STATE"
         ,"CA_ZIP"
         ,"CA_COUNTRY"
         ,"CA_GMT_OFFSET"
         ,"CA_LOCATION_TYPE"
      )
   )
   LOCATION ('customer_address.dat.gz')
)
REJECT LIMIT UNLIMITED
;

SQL> select count(*) from ET_CUSTOMER_ADDRESS;

  COUNT(*)
----------
     50000

Now let’s double check:

$ gunzip -c customer_address.dat.gz | wc -l
50000

Note: The preprocessor option does not allow the use│, , &, and $ characters due to security reasons.

This is a great enhancement for those who transport compressed files around their networks and want to load them directly into their database via External Tables. One advantage of this feature is that when loading flat files from an NFS staging area, the network traffic is reduced by N, where N is the compression ratio of the file. For example, if your flat file compresses 10x (which is not uncommon), then you get an effective gain of 10x the throughput for the same network bandwidth. Or if you like, the required network bandwidth is reduced 10x to transfer the same logical data set. In this case the compression rate was 4x.

There are a few things to be aware of when using this feature. If the external table is parallel, then the number of files in the External Table Location clause should be equal or greater than the degree of parallelism (DOP). This is because the preprocessor outputs a stream and this stream can not be broken down into granules for multiple Parallel Query Slaves to work on, like a normal uncompressed text file. Each PQ Slave can work on at most, 1 file/stream. For example, if you have a DOP of 16 set on the External Table, but only have 10 files, 10 PQ Slaves will be busy and 6 will be idle, as there are more slaves than files. This means that to get optimal throughput the number of files should be a multiple of the DOP. Obviously this is not always possible so the recommendation is to have more smaller files vs. fewer larger files. This will limit the skew in the workload for the PQ Slaves if/when there are “remainder” files.

Hopefully you will find this enhancement very useful. I sure do.

14 comments

  1. Nigel Thomas

    From your description, it looks like this maintains the limitation that the file must be an actual physical file – and I bet they stat it to make sure. So there’s no way you can use this as a cheap and cheerful generic data gateway. I’d like to be able to load from a pipe (created using mknod) which then allows me to pour data into the database from any source – as we do for SQL*Loader.

    In an earlier life, we (at Constellar) used this to build an ETL processing pipeline; data from foreign systems was received (as file or message); piped into a PL/SQL procedure (using UTL_FILE reads) for validation, enrichment and transformation; piped out using UTL_FILE to a unix pipe and then finally direct loaded from the pipe into the target database (which could be a local or remote Oracle instance – or another foreign database such as DB/2. Using this technique (parallelised of course, as well as pipelined) meant that we minimised the overheads of the transformation and load by avoiding the use of unnecessary temporary tables.

    It would have simplified this process greatly if we could have used external tables (avoiding the UTL_FILE reads and parsing) but because they were based on a pipeline from an external process, that would have been no dice.

    Still, not to worry, this is a useful addition…

  2. Carey

    Nigel,
    I wonder if you could get by with a dummy file to make it happy, and use a script as your “preprocessor” to read the pipe instead.

  3. Pingback: Rittman Mead Consulting » Blog Archive » Preprocessing Input Files and 11.1.0.7 External Tables
  4. Stewart Bryson

    Greg:

    Are you constrained by uncompress-type commands, or could the preprocessor call other sorts of command-line entities, such as decrypt commands, etc?

    Nigel:

    If you are loading an Oracle database with data from a remote Oracle database… there’s no need to trouble yourself with all of this. The INSERT /*+ APPEND */ into SELECT * FROM… across a DBLINK will perform similarly. This initiates the direct path API, very similar to SQL-Loader’s DIRECT mode. However, using the direct-path API inside of SQL gives some benefits that SQL-Loader never had: the MERGE statement, multi-table insert, etc.

    Your point about using this scenario to load from other databases is not lost on me, however: that would be very valuable indeed. That’s what Informatica does to speed up loading into Oracle databases. Since it cannot support the direct-path API from inside SQL, it actually streams data through SQL-Loader in a similar way.

    Regards,

    Stewart

  5. Nigel Thomas

    @Carey – excellent idea – of course we can make a proxy file for any data source (which could contain the real parameters for the pre-processor – like source URL for example)

    @Stewart – yes, the main benefit of streaming is (a) for integration with a non-Oracle source and (b) so that data-in-flight (maybe from SQL Server to DB/2) doesn’t *have* to be physically staged in the (Oracle) transformation hub before it is delivered to its final destination. In the old days we did that with a PL/SQL pipeline using UTL_FILE in and out, then piping into SQL*Loader; nowadays we can have the source treated as an external table and do the transformation as part of the SELECT from that.

  6. Pingback: Log Buffer #124: a Carnival of the Vanities for DBAs
  7. Jens Petersen

    From a developers perspective this is a great new feature and I’m just about to write a paper how to use this to list all files in a directory.
    But from a security perspective this sounds worryingly like opening up new huge holes.
    Is there any way to restrict the usage of the preprocessor clause to certain users or disable it at all?

  8. Greg Rahn

    @Jens

    I’m not sure what is so great about listing files in a directory via an external table that it warrants a paper, but whatever…

    Restricting “create directory” or limiting who has execute on the directory (database directory, not filesystem) are options. Most utilities would be in /bin or /usr/local/bin or similar so only root could put files there to be used in the preprocessor.

  9. Bob

    Regarding security, the user running the load needs to be granted EXECUTE permission on “exec_dir” (in this example). One recommendation is that rather than granting execute on exec_dir which in this example points to /bin, is simply to create another directory that holds just the execuatbles (or scripts) that the preprocessor will use.
    EXECUTE is a new permission that can be grant to a directory in 11.1.0.7.

  10. Pingback: New External Table Functionality - PreProcessing and Data Pump (The Data Warehouse Insider)
  11. Pingback: The Core Performance Fundamentals Of Oracle Data Warehousing – Data Loading
  12. Pingback: Neat Oracle Database 11g Release 2 Feature: Preprocessing External Tables » Eddie Awad's Blog
  13. Pingback: External Table Preprocessor Feature in Oracle Database » Eddie Awad's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s