The Core Performance Fundamentals Of Oracle Data Warehousing – Partitioning

[back to Introduction]

Partitioning is an essential performance feature for an Oracle data warehouse because partition elimination (or partition pruning) generally results in the elimination of a significant amount of table data to be scanned. This results in a need for less system resources and improved query performance. Someone once told me “the fastest I/O is the one that never happens.” This is precisely the reason that partitioning is a must for Oracle data warehouses – it’s a huge I/O eliminator. I frequently refer to partition elimination as the anti-index. An index is used to find a small amount data that is required; partitioning is used to eliminate vasts amounts of data that is not required.

Main Uses For Partitioning

I would classify the main reasons to use partitioning in your Oracle data warehouse into these four areas:

  • Data Elimination
  • Partition-Wise Joins
  • Manageability (Partition Exchange Load, Local Indexes, etc.)
  • Information Lifecycle Management (ILM)

Partitioning Basics

The most common partitioning design pattern found in Oracle data warehouses is to partition the fact tables by range (or interval) on the event date/time column. This allows for partition elimination of all the data not in the desired time window in queries. For example: If I have a fact table that contains point of sale (POS) data, each line item for a given transaction has a time stamp of when the item was scanned. Let’s say this value is stored in column EVENT_TS which is a DATE or TIMESTAMP data type. In most cases it would make sense to partition by range on EVENT_TS using one day partitions. This means every query that uses a predicate filter on EVENT_TS (which should be nearly every one) can eliminate significant amounts of data that is not required to satisfy the query predicate.  If you want to look at yesterday’s sales numbers, there is no need to bring back rows from last week or last month!

Subpartitioning Options

Depending on the schema design of your data warehouse you may also chose to subpartition a table. This allows one to further segment a table to allow for even more data elimination or it can allow for partition-wise joins which allow for reduced usage of CPU and memory resources by minimizing the amount of data exchanged between parallel execution server processes. In third normal form (3NF) schemas it is very beneficial to use hash partitioning or subpartitioning to allow for partition-wise joins (see Oracle Parallel Execution: Interconnect Myths And Misunderstandings) for this exact reason. Dimensional models (star schemas) may also benefit from hash subpartitioning and partition-wise joins. Generally it is best to hash subpartition on a join key column to a very large dimension, like CUSTOMER, so that a partition-wise join will be used between the fact table and the large dimension table.

Manageability

Managing large objects can be challenging for a variety of reasons which is why Oracle Partitioning allows for many operations to be done at a global or partition (or subpartition) level.  This makes it much easier to deal with tables or indexes of large sizes.  It also is transparent to applications so the SQL that runs against a non-partitioned object will run as-is against a partitioned object.  Some of the key features include:

  • Partition Exchange Load – Data can be loaded “out of line” and exchanged into a partitioned table.
  • Local Indexes – It takes much less time to build local indexes than global indexes.
  • Compression – Can be applied at the segment level so it’s possible to have a mix of compressed and non-compressed partitions.
  • Segment Moves/Rebuilds/Truncates/Drops – Each partition (or subpartition) is a segment and can be operated on individually and independently of the other partitions in the table.
  • Information Lifecycle Management (ILM) – Partitioning allows implementation of an ILM strategy.

Summary

I’d classify partitioning as a “must have” for Oracle data warehouses for both the performance and manageability reasons described above. Partitioning should lower query response time as well as resource utilization do to “smart” data access (only go after the data the query needs). There are additional partitioning design patterns as well and the Oracle documentation contains descriptions of them as well as examples.

Oracle Documentation References:

22 comments

  1. Fahd Mirza

    Greg, would you please look at the following point and comment:

    Almost all the times, fact tables in the data warehouses should be partitioned. If the data is loaded w.r.t time then the partition key should be the date column. All the foreign keys in fact table should have bit map indexes upon them. Local prefixed indexs should be created on the partitions haveing the partition key at the leading edge, and these indexes should be B Tree. Stats should be gathered after every data load and the estimate size should be DBMS_STATS.AUTO_SAMPLE_SIZE.

    Thanks and regards

  2. Greg Rahn

    @Fahd Mirza
    It would help to provide a bit more context as to what exactly you are asking or trying to achieve, but here are my answers/guesses?

    Q: Should tables loaded by time have the time key as the partition key?
    A: The partition key should be the column that represents when the event happened and is frequently used in the predicate of queries so partition pruning/elimination can be leveraged. E.g. How much was sold yesterday (or last Tuesday) can prune to 1 day or 1 month, etc.

    Q: Should the fact table FK columns have bitmap indexes on them?
    A: If your design is to leverage star transformations, then yes. Read more.

    Q: Should indexes be local prefixed b-tree?
    A: Possibly. Indexes should have a design that solves a problem. E.g. Use what design is necessary to get the execution plan you desire, but if you are using star transformation, then it is unlikely you will have any other indexes other than the FK bitmaps.

    Q: Should stats be gathered after every load with DBMS_STATS.AUTO_SAMPLE_SIZE?
    A: Stats should be gathered where there are none or they are stale. If on 11g, then yes, else generally yes, but there may be exceptions (read more).

  3. Fahd Mirza

    @Greg, thanks for the answer. You certainly cleared many things.

    My question is that as in a dimensional model, there are certain dimension tables and then there is a fact table. In the fact table, there are foreign keys from the dimensions and then there are fact columns:

    e.g.

    SQL> DESC Customer_Fact

    Customer_Fact_pk ——–> Surrogate Key
    Customer_Transaction_Date_fk ——–> Foreign Key from the Time dimension
    Customer_Region_fk ———–> Foreign Key from the Region dimension
    Amount—–> additive fact

    Now please correct/validate me on the following point:

    we should create bitmap indexes on Customer_Transaction_Date_fk and Customer_Region_fk columns.

    What sort of index should we create on Customer_Fact_pk and Amount column?

    Should we range partition this fact table on Customer_Transaction_Date_fk? But as this column contains number and not a date, so should we include the date column also in the fact table and partition the table on that date column?

    Thanks and best regards

    Fahd

  4. Greg Rahn

    @Fahd Mirza

    To quote the link from the Oracle Documentation from my previous response:

    A bitmap index should be built on each of the foreign key columns of the fact table or tables

    so, if you desire star transformation, then both FK columns on the fact table require bitmap indexes.

    The date column can be represented by a number which is the key to the time/date dimension so it is ok to use this for the partitioning key. Oracle can use subquery pruning or bloom filters to first get the list of values from the time/date dimension and use that to get partition pruning.

    The indexes you create should be to solve a problem. You have not stated what you are trying to solve so I really can not make any index recommendations. It’s like asking “should I go to the doctor”? The answer depends on if you are sick or not, and you have not provided enough detail to determine this.

  5. Paul James

    So, in a data warehouse with a transactional fact table with a date dimension (where the surrogate key for the dimension is populated by a sequence) how would you approach partitioning by month?

    Hard code surrogate keys for list partitioning, just partition by day surrogate key (and ignore the month) or something else?

    Thanks,
    Paul

  6. Greg Rahn

    @Paul James

    For 1 month partitions you can use range partitioning and use the date key for first day of the next month as the “values less than” value.

    For example:
    create table fact ( ... )
    partition by range(date_key) (
    partition p200912 values less than({date_key for 2010-01-01}),
    partition p201001 values less than({date_key for 2010-02-01}),
    partition p201002 values less than({date_key for 2010-03-01}),
    ...)

  7. Paul James

    @Greg Rahn

    The surrogate keys are just numbers, so 2010-01-01 could be 245 and 2010-02-01 : 263 etc.

    And because they are generated from a sequence there is no guarantee thet the key for a date has an relation to it’s date sequence.

    So I’m not sure your example applies in this scenario.

  8. Greg Rahn

    @Paul James

    As long as the keys are in a sorted order from month to month (they can be out of order within a month, but not cross month boundaries) it will work (though you should modify the date dimension data population to guarantee this — its a very good design practice). If this is not the case, you would be best to redesign it to be such as the gains will likely be quite significant given partition elimination.

    I’d recommend reading up on these Kimball Design Tips:

    Kimball Design Tip #2: Multiple Time Stamps
    Kimball Design Tip #5: Surrogate Keys for the Time Dimension
    Kimball Design Tip #51: Latest Thinking on Time Dimension Table
    Kimball Design Tip #85: Smart Date Keys to Partition Fact Tables

  9. Pingback: The Core Performance Fundamentals Of Oracle Data Warehousing – Introduction
  10. Eric

    Totally agree, partitioning is very important for ORACLE data warehouse design.

    Hi Greg, can you comment on the following scenario please?

    The Product_Dim (SCD-1) has a composite PK (COMPANY_ID, ITEM_ID): COMPANY_ID is 3-digit; ITEM_ID is 15-digit;

    There are more than 500M unique PK in this dimension. Within all the COMPANY_IDs, there are 10 IDs which have 80% of the items associated. So the dimension table is RANGE-HASH partition by (COMPANY_ID, ITEM_ID) so that these 10 big COMPANY_IDs have their own partition (such as P_101, P104, P105, P128, P133, P201, etc.).

    The Fact table has COMPANY_ID + ITEM_ID as disabled FK constraints. This Fact table is range partitioned on range(region_id, transaction_local_ts). The region_id has 5 values so far, representing 5 geo-timezone regions, as well as 5 ETL batches everyday. Total about 70M~100M rows loaded into Fact table daily.

    Now, if we use subpartition by hash(ITEM_ID), the partition-wise join between Fact and Product_Dim can’t happen. If we use subpartition by hash(ITEM_ID, COMPANY_ID), the partition-wise join between Fact and Product_Dim can’t happen either. What will be an optimal way to partition both tables?

    Thanks

  11. Eric

    “There are more than 500M unique PK in this dimension” means “There are more than 500M unique PK values in this dimension”

  12. Greg Rahn

    @Eric

    “The Product_Dim (SCD-1) has a composite PK (COMPANY_ID, ITEM_ID)”

    I think your problem starts there. I’m quite sure that dimensions are supposed to have a single column surrogate primary key. It seems to me like your product_dim table has both products and companies in it (500M unique products does seem rather large to me.)

    Partitioning has a purpose, be it for queries or for loads; what is the purpose of your partitioning?

  13. ivin kris

    If Oracle B Tree index is balanced, why would a table with no paritions need to scan more rows if the time column is index and the query is based on that? And if oracle uses multi-version read concurrency why does large data load bother concurrency? Why can I not load large data directly to the table instead of loading to temp table and swapping the partition?

  14. Greg Rahn

    If Oracle B Tree index is balanced, why would a table with no paritions need to scan more rows if the time column is index and the query is based on that?
    First, balance has nothing to do with what rows are retrieved from a table. Partitioning allows multi-block scans, where index access translates to single block access to the table. For a large number of rows partitioning will prevail when it comes to performance.

    And if oracle uses multi-version read concurrency why does large data load bother concurrency?
    Bother concurrency of what?

    Why can I not load large data directly to the table instead of loading to temp table and swapping the partition?
    You can certainly load directly into a table. Both are options depending on the requirements.

  15. ivin kris

    1) I meant that if the BTree is balanced and the depth is always 3, the number of io’s needed to select data in a table without partitions vs with partitions should not have difference.isn’t it? Why is it said that partitions eliminate the need to scan largely unwanted data. Could you please explain.
    2) if oracle uses multi-version read concurrency why does large data load have contentions with other transactions reading data from the table. It shouldn’t isn’t it?
    3) Depending on the reuirement? My requirement is that I ma having to update large amount of data in the table when the users are running queries against it. Do I need a partition swap in this scenario was my question.

  16. ivin kris

    Assuming I have range partition on the date column, all the users are running queries for the same date(current business date), when I have large updates that needs to be processed for the data in this partition,

    1) How does partition swap help for large load manageability reducing contentions with user read activities. Isn’t the multi-version read consistency already addressing the contention?
    2) If there was no partition, why would the number of rows to be scanned becomes more if all the queries need the data for the same date (Current business date) (the business date column is indexed)

  17. Greg Rahn

    1) I meant that if the BTree is balanced and the depth is always 3, the number of io’s needed to select data in a table without partitions vs with partitions should not have difference.isn’t it? Why is it said that partitions eliminate the need to scan largely unwanted data. Could you please explain.

    Again, index balance and depth has nothing to do with the number of IOs to the table. It’s also irrelevant to the point being made here. Partitioning allows the identification of data that is not needed. I’d suggest you read Partition Pruning section in the Partitioning Concepts chapter of the VLDB and Partitioning Guide — it explains it all.

    2) if oracle uses multi-version read concurrency why does large data load have contentions with other transactions reading data from the table. It shouldn’t isn’t it?

    It’ doesn’t. Not sure where you are reading that from my post.

    3) Depending on the reuirement? My requirement is that I ma having to update large amount of data in the table when the users are running queries against it. Do I need a partition swap in this scenario was my question.

    Bulk updates in general should be avoided. Bulk updates 1) are logging and 2) are slowed down by compression. The fastest way to do a bulk update is to do a nologging insert, changed the data in flight, and partition swap or table rename, given an appropriately sized system.

  18. ivin kris

    I read this,
    “Partition pruning is the simplest and also the most substantial means to improve performance using partitioning. Partition pruning can often improve query performance by several orders of magnitude. For example, suppose an application contains an Orders table containing a historical record of orders, and that this table has been partitioned by week. A query requesting orders for a single week would only access a single partition of the Orders table. If the Orders table had 2 years of historical data, then this query would access one partition instead of 104 partitions. This query could potentially execute 100 times faster simply because of partition pruning.”

    THat is where I am not sure, how. If the business date column is indexed, in a table of 10 million rows, 1 million being in the current business date, how does partitions limit the number of rows to be scanned. It does tell what internally it means to say that the queries can execute faster in smaller partitions than the ones scanning multiple partitions. I am an application programmer and not a DBA. Is it possible to give little more indepth explanation of an overhead associated to multi-partitions scanning and the performance issues related to it. And I am not even talking about multi-partitions here. What if the table isn’t partitioned at all. Will the query still perform slower?

  19. Greg Rahn

    Understand that indexes are structures to access data sets that are very selective (few rows). If a query needs 1/10 of the data in the table (1M of 10M rows, in this case) that really is not selective enough for an index. Besides, in the worse case (1 row you want per block), that would translate in 1M single block reads. With a partition scan, the IO size could be up to 1MB. So you see the difference here? If you have to access a significant amount of blocks would you rather do that one 8KB block at a time, or 1MB (128 8K blocks) at a time. It simply comes down to the physics of data access and the times associated with it.

  20. Pingback: Les principes fondamentaux d’un datawarehouse – Le Partitionnement, par Greg Rahn | Oracle – Concepts et Exemples

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s