[back to Introduction]
Getting flat file data into your Oracle data warehouse is likely a daily (or more possibly frequent) task, but it certainly does not have to be a difficult one. Bulk loading data rates are governed by the following operations and hardware resources:
- How fast can the data be read
- How fast can data be written out
- How much CPU power is available
I’m always a bit amazed (and depressed) when I hear people complain that their data loading rates are slow and they proceed to tell me things like:
- The source files reside on a shared NFS filer (or similar) and it has just a single GbE (1 Gigabit Ethernet) network path to the Oracle database host(s).
- The source files reside on this internal disk volume which consists of a two disk mirror (or a volume with very few spindles).
Maybe it’s not entirely obvious so let me spell it out (as I did in this tweet):
One can not load data into a database faster than it can be delivered from the source. Database systems must obey the laws of physics!
Or putting it another way: Don’t fall victim to slow data loading because of a slow performing data source.
Given a system that can provide data at fast enough rates, the loading rate becomes a factor of point #2 and #3. The database operations can be simplified to:
- Read lines from flat files
- Process lines into columns, internal data types and optionally compress
- Write rows/columns out to disk
In most cases a reasonably size system becomes CPU bound, not write bound, on data loads as almost all Oracle data warehouses use compression which increases CPU consumption but reduces the IO requirement for the writes. Or putting it another way: Bulk loading into a compressed table should be a CPU bound operation, not a disk (write) bound operation.
Data Loading Best Practices (What To Do and Why To Do It)
Oracle offers two methods to load data from flat files: 1) SQL*Loader and 2) External Tables. I would highly recommend that bulk loads (especially PDML loads) be done via External Tables and SQL*Loader only be used for non-parallel loads (PARALLEL=false) with small amounts of data (not bulk loads). The high level reason for this recommendation is that External Tables have nearly all the SQL functionality of a heap table and allow numerous more optimizations than SQL*Loader does and there are some undsireable side effects (mostly in the space management layer) from using PARALLEL=true with SQL*Loader.
In order to avoid the reading of the flat files being the bottleneck, use a filesystem that is backed by numerous spindles (more than enough to provide the desired loading rate) and consider using compressed files in conjunction with the external table preprocessor. Using the preprocessor is especially useful if there are proportionally more CPU resources on the database system than network or disk bandwidth because the use of compression on the source files allows for a larger logical row “delivery” for a given file size. Something that may not be obvious either is to put the flat files on a filesystem that is mounted using directio mount options. This will eliminate the file system cache being flooded with data that will (likely) never be read again (how many times do you load the same files?). Another option that becomes available with Oracle 11.2 is DBFS (database filesystem) and is what is frequently used with the Oracle Database Machines & Exadata which is a fast and scalable solution for staging flat files.
In order to achieve the best loading speeds be sure to:
- Use External Tables
- Use a staging filesystem (and network) fast enough to meet your loading speed requirements (and consider directio mount options)
- Use Parallel Execution (parallel CTAS or PDML INSERT)
- Use direct-path loads (nologging CTAS or INSERT /*+ APPEND */)
- Use a large enough initial/next extent size (8MB is usually enough)
If you follow these basic recommendations you should be able to achieve loading speeds that easily meet your requirements (otherwise you likely just need more hardware).
Loading Data At Ludicrous Speed
I’ve yet to come across a reasonably designed system that is capable of becoming write bound as systems simply either 1) do not have enough CPU to do so or 2) are unable to read the source flat files anywhere near fast enough to do so. I have, however, conducted experiments to test write throughput of a Sun Oracle Database Machine (Exadata V2) by using flat files cached completely in the filesystem cache and referencing them numerous times in the External Table DDL. The results should be quite eye opening for many, especially those who think the Oracle database can not load data fast. Loading into an uncompressed table, I was able load just over 1TB of flat file data (over 7.8 billion rows) in a mear 4.6 minutes (275 seconds). This experiment does not represent typical loading speed rates as it’s unlikely the source files are on a filesystem as fast as main memory, but it does demonstrate that if the flat file data could be delivered at such rates, the Oracle software and hardware can easily load it at close to physics speed (the max speed the hardware is capable of).
SQL> create table fastload 2 pctfree 0 3 parallel 4 nologging 5 nocompress 6 storage(initial 8m next 8m) 7 tablespace ts_smallfile 8 as 9 select * from et_fastload; Table created. Elapsed: 00:04:35.49 SQL> select count(*) from fastload; COUNT(*) ------------- 7,874,466,950 Elapsed: 00:01:06.54 SQL> select ceil(sum(bytes)/1024/1024) mb from user_segments where segment_name='FASTLOAD'; MB --------- 1,058,750 SQL> exec dbms_stats.gather_table_stats(user,'FASTLOAD'); PL/SQL procedure successfully completed. SQL> select num_rows,blocks,avg_row_len from user_tables where table_name='FASTLOAD'; NUM_ROWS BLOCKS AVG_ROW_LEN ---------- ---------- ----------- 7874466950 135520008 133
Just so you don’t think I’m making this stuff up, check out the SQL Monitor Report for the execution, noting the IO throughput graph from the Metrics tab (10GB/s write throughput isn’t half bad).
So as you can see, flat file data loading has really become more of a data delivery problem rather than a data loading problem. If the Oracle Database, specifically the Exadata powered Oracle Database Machine, can bulk load data from an external table whose files reside in the filesystem cache at a rate of 13TB per hour (give or take), you probably don’t have to worry too much about meeting your data loading rate business requirements (wink).
Note: Loading rates will vary slightly depending on table definition, number of columns, data types, compression type, etc.
Before External Tables existed in the Oracle database, loading from flat files was done via SQL*Loader. One option that some used was to have a compressed text file and load it with SQL*Loader via a named pipe. This allowed one not to have to extract the file, which could be several times the size of the compressed file. As of 18.104.22.168, a similar feature is now available for External Tables (and will be in 10.2.0.5). This enhancement is a result of Bug 6522622 which is mentioned in the Bugs fixed in the 22.214.171.124 Patch Set note. Unfortunately it appears that there aren’t any notes on how to actually use the External Table Preprocessor so allow me to give some insight into its use.
PREPROCESSOR clause is part of the
record_format_info clause. The syntax of the
PREPROCESSOR clause is as follows:
PREPROCESSOR [directory_spec:] file_spec [preproc_options_spec]
It’s pretty straight forward when you see an example. Line 31 contains the new clause.
create or replace directory load_dir as '/data/tpc-ds/flat_files/1gb'; create or replace directory log_dir as '/tmp'; create or replace directory exec_dir as '/bin'; -- -- ET_CUSTOMER_ADDRESS -- DROP TABLE ET_CUSTOMER_ADDRESS; CREATE TABLE ET_CUSTOMER_ADDRESS ( "CA_ADDRESS_SK" NUMBER ,"CA_ADDRESS_ID" CHAR(16) ,"CA_STREET_NUMBER" CHAR(10) ,"CA_STREET_NAME" VARCHAR2(60) ,"CA_STREET_TYPE" CHAR(15) ,"CA_SUITE_NUMBER" CHAR(10) ,"CA_CITY" VARCHAR2(60) ,"CA_COUNTY" VARCHAR2(30) ,"CA_STATE" CHAR(2) ,"CA_ZIP" CHAR(10) ,"CA_COUNTRY" VARCHAR2(20) ,"CA_GMT_OFFSET" NUMBER ,"CA_LOCATION_TYPE" CHAR(20) ) ORGANIZATION EXTERNAL ( TYPE oracle_loader DEFAULT DIRECTORY load_dir ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE PREPROCESSOR exec_dir:'gunzip' OPTIONS '-c' BADFILE log_dir: 'CUSTOMER_ADDRESS.bad' LOGFILE log_dir: 'CUSTOMER_ADDRESS.log' FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ( "CA_ADDRESS_SK" ,"CA_ADDRESS_ID" ,"CA_STREET_NUMBER" ,"CA_STREET_NAME" ,"CA_STREET_TYPE" ,"CA_SUITE_NUMBER" ,"CA_CITY" ,"CA_COUNTY" ,"CA_STATE" ,"CA_ZIP" ,"CA_COUNTRY" ,"CA_GMT_OFFSET" ,"CA_LOCATION_TYPE" ) ) LOCATION ('customer_address.dat.gz') ) REJECT LIMIT UNLIMITED ; SQL> select count(*) from ET_CUSTOMER_ADDRESS; COUNT(*) ---------- 50000
Now let’s double check:
$ gunzip -c customer_address.dat.gz | wc -l 50000
Note: The preprocessor option does not allow the use│, , &, and $ characters due to security reasons.
This is a great enhancement for those who transport compressed files around their networks and want to load them directly into their database via External Tables. One advantage of this feature is that when loading flat files from an NFS staging area, the network traffic is reduced by N, where N is the compression ratio of the file. For example, if your flat file compresses 10x (which is not uncommon), then you get an effective gain of 10x the throughput for the same network bandwidth. Or if you like, the required network bandwidth is reduced 10x to transfer the same logical data set. In this case the compression rate was 4x.
There are a few things to be aware of when using this feature. If the external table is parallel, then the number of files in the External Table Location clause should be equal or greater than the degree of parallelism (DOP). This is because the preprocessor outputs a stream and this stream can not be broken down into granules for multiple Parallel Query Slaves to work on, like a normal uncompressed text file. Each PQ Slave can work on at most, 1 file/stream. For example, if you have a DOP of 16 set on the External Table, but only have 10 files, 10 PQ Slaves will be busy and 6 will be idle, as there are more slaves than files. This means that to get optimal throughput the number of files should be a multiple of the DOP. Obviously this is not always possible so the recommendation is to have more smaller files vs. fewer larger files. This will limit the skew in the workload for the PQ Slaves if/when there are “remainder” files.
Hopefully you will find this enhancement very useful. I sure do.