Category: Data Warehousing

Real-World Performance Videos on YouTube – Data Warehousing

Here are some videos of a data warehouse demo that the Real-World Performance Group has been running for a while now and we thought it was time to put them on YouTube. Hope you find them informative.

Migrate a 1TB Data warehouse in 20 Minutes (Part 1)

Migrate a 1TB Data warehouse in 20 Minutes (Part 2)

Migrate a 1TB Data warehouse in 20 Minutes (Part 3)

Migrate a 1TB Data warehouse in 20 Minutes (Part 4)

Reading Parallel Execution Plans With Bloom Pruning And Composite Partitioning

You’ve probably heard sayings like “sometimes things aren’t always what they seem” and “people lie”. Well, sometimes execution plans lie. It’s not really by intent, but it is sometimes difficult (or impossible) to represent everything in a query execution tree in nice tabular format like dbms_xplan gives.

One of the optimizations that was introduced back in 10gR2 was the use of bloom filters. Bloom filters can be used in two ways: 1) for filtering or 2) for partition pruning (bloom pruning) starting with 11g. Frequently the data models used in data warehousing are dimensional models (star or snowflake) and most Oracle warehouses use simple range (or interval) partitioning on the fact table date key column as that is the filter that yields the largest I/O reduction from partition pruning (most queries in a time series star schema include a time window, right!). As a result, it is imperative that the join between the date dimension and the fact table results in partition pruning.

Let’s consider a basic two table join between a date dimension and a fact table. For these examples I’m using STORE_SALES and DATE_DIM which are TPC-DS tables (I frequently use TPC-DS for experiments as it uses a dimensional (star) model and has a data generator.) STORE_SALES contains a 5 year window of data ranging from 1998-01-02 to 2003-01-02.

Range Partitioned STORE_SALES

For this example I used range partitioning on STORE_SALES.SS_SOLD_DATE_SK using 60 one month partitions (plus 1 partition for NULL SS_SOLD_DATE_SK values) that align with the date dimension (DATE_DIM) on calendar month boundaries. STORE_SALES has the parallel attribute (PARALLEL 16 in this case) set on the table to enable Oracle’s Parallel Execution (PX). Let’s look at the execution time and plan for our test query:

SQL> select
  2    max(ss_sales_price)
  3  from
  4    store_sales ss,
  5    date_dim d
  6  where
  7    ss_sold_date_sk = d_date_sk and
  8    d_year = 2000
  9  ;

MAX(SS_SALES_PRICE)
-------------------
                200

Elapsed: 00:00:41.67

SQL> select * from table(dbms_xplan.display_cursor(format=>'basic +parallel +partition +predicate'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select   max(ss_sales_price) from   store_sales ss,   date_dim d where
 ss_sold_date_sk=d_date_sk and   d_year = 2000

Plan hash value: 934332680

---------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name         | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |             |       |       |        |      |            |
|   1 |  SORT AGGREGATE               |             |       |       |        |      |            |
|   2 |   PX COORDINATOR              |             |       |       |        |      |            |
|   3 |    PX SEND QC (RANDOM)        | :TQ10001    |       |       |  Q1,01 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE            |             |       |       |  Q1,01 | PCWP |            |
|*  5 |      HASH JOIN                |             |       |       |  Q1,01 | PCWP |            |
|   6 |       BUFFER SORT             |             |       |       |  Q1,01 | PCWC |            |
|   7 |        PART JOIN FILTER CREATE| :BF0000     |       |       |  Q1,01 | PCWP |            |
|   8 |         PX RECEIVE            |             |       |       |  Q1,01 | PCWP |            |
|   9 |          PX SEND BROADCAST    | :TQ10000    |       |       |        | S->P | BROADCAST  |
|* 10 |           TABLE ACCESS FULL   | DATE_DIM    |       |       |        |      |            |
|  11 |       PX BLOCK ITERATOR       |             |:BF0000|:BF0000|  Q1,01 | PCWC |            |
|* 12 |        TABLE ACCESS FULL      | STORE_SALES |:BF0000|:BF0000|  Q1,01 | PCWP |            |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("SS_SOLD_DATE_SK"="D_DATE_SK")
  10 - filter("D_YEAR"=2000)
  12 - access(:Z>=:Z AND :Z<=:Z)

In this execution plan you can see the creation of the bloom filter on line 7 which is populated from the values of D_DATE_SK from DATE_DIM. That bloom filter is then used to partition prune on the STORE_SALES table. This is why we see :BF0000 in the Pstart/Pstop columns.

Range-Hash Composite Partitioned STORE_SALES

For the next experiment, I kept the same range partitioning scheme but also added hash subpartitioning using the SS_ITEM_SK column (using 4 hash subpartitions per range partition). STORE_SALES2 has 61 range partitions x 4 hash subpartitions for a total of 244 aggregate partitions. Let’s look at the execution plan for our test query:

SQL> select
  2    max(ss_sales_price)
  3  from
  4    store_sales2 ss,
  5    date_dim d
  6  where
  7    ss_sold_date_sk = d_date_sk and
  8    d_year = 2000
  9  ;

MAX(SS_SALES_PRICE)
-------------------
                200

Elapsed: 00:00:41.06

SQL> select * from table(dbms_xplan.display_cursor(format=>'basic +parallel +partition +predicate'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select   max(ss_sales_price) from   store_sales2 ss,   date_dim d where
  ss_sold_date_sk=d_date_sk and   d_year = 2000

Plan hash value: 2496395846

---------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name         | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |              |       |       |        |      |            |
|   1 |  SORT AGGREGATE               |              |       |       |        |      |            |
|   2 |   PX COORDINATOR              |              |       |       |        |      |            |
|   3 |    PX SEND QC (RANDOM)        | :TQ10001     |       |       |  Q1,01 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE            |              |       |       |  Q1,01 | PCWP |            |
|*  5 |      HASH JOIN                |              |       |       |  Q1,01 | PCWP |            |
|   6 |       BUFFER SORT             |              |       |       |  Q1,01 | PCWC |            |
|   7 |        PART JOIN FILTER CREATE| :BF0000      |       |       |  Q1,01 | PCWP |            |
|   8 |         PX RECEIVE            |              |       |       |  Q1,01 | PCWP |            |
|   9 |          PX SEND BROADCAST    | :TQ10000     |       |       |        | S->P | BROADCAST  |
|* 10 |           TABLE ACCESS FULL   | DATE_DIM     |       |       |        |      |            |
|  11 |       PX BLOCK ITERATOR       |              |     1 |     4 |  Q1,01 | PCWC |            |
|* 12 |        TABLE ACCESS FULL      | STORE_SALES2 |     1 |   244 |  Q1,01 | PCWP |            |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("SS_SOLD_DATE_SK"="D_DATE_SK")
  10 - filter("D_YEAR"=2000)
  12 - access(:Z>=:Z AND :Z<=:Z)

Once again you can see the creation of the bloom filter from DATE_DIM on line 7, however you will notice that we no longer see :BF0000 as our Pstart and Pstop values. In fact, it may appear that partition pruning is not taking place at all as we see 1/244 as our Pstart/Pstop values. However, if we compare the execution times between the range and range/hash queries you note they are identical to the nearest second, thus there really is no way that partition (bloom) pruning is not taking place. After all, if this plan read all 5 years of data it would take 5 times as long as reading just 1 year and that certainly is not the case. Would you have guessed that partition pruning is taking place had we not worked though the range only experiment first? Hmmm…

So What Is Going On?

Before we dive in, let’s quickly look at what the execution plans would look like if PX was not used (using serial execution).

--
-- Range Partitioned, Serial Execution
--

---------------------------------------------------------------------
| Id  | Operation                     | Name        | Pstart| Pstop |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT              |             |       |       |
|   1 |  SORT AGGREGATE               |             |       |       |
|*  2 |   HASH JOIN                   |             |       |       |
|   3 |    PART JOIN FILTER CREATE    | :BF0000     |       |       |
|*  4 |     TABLE ACCESS FULL         | DATE_DIM    |       |       |
|   5 |    PARTITION RANGE JOIN-FILTER|             |:BF0000|:BF0000|
|   6 |     TABLE ACCESS FULL         | STORE_SALES |:BF0000|:BF0000|
---------------------------------------------------------------------
              
--
-- Range-Hash Composite Partitioned, Serial Execution
--
                                       
----------------------------------------------------------------------
| Id  | Operation                     | Name         | Pstart| Pstop |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT              |              |       |       |
|   1 |  SORT AGGREGATE               |              |       |       |
|*  2 |   HASH JOIN                   |              |       |       |
|   3 |    PART JOIN FILTER CREATE    | :BF0000      |       |       |
|*  4 |     TABLE ACCESS FULL         | DATE_DIM     |       |       |
|   5 |    PARTITION RANGE JOIN-FILTER|              |:BF0000|:BF0000|
|   6 |     PARTITION HASH ALL        |              |     1 |     4 |
|   7 |      TABLE ACCESS FULL        | STORE_SALES2 |     1 |   244 |
----------------------------------------------------------------------

When using composite partitioning, pruning is placed on one of the partition iterators. When the two nested partition iterators (range/hash in this case) are changed into a block iterator (line 14 – PX BLOCK ITERATOR), we have to pick a “victim” in the query plan tree since only one node in the plan needs now to carry the pruning information (with PX the pruning is really done by the QC, not the row source like in serial plans). As a result, the information associated the the victimized partition iterator is lost in the explain plan. This is why there is no :BF0000 for Pstart/Pstop in the plan in this case. It is probably more accurate to have the parallel plans for both range and range/hash look like this:

---------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name         | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |              |       |       |        |      |            |
|   1 |  SORT AGGREGATE               |              |       |       |        |      |            |
|   2 |   PX COORDINATOR              |              |       |       |        |      |            |
|   3 |    PX SEND QC (RANDOM)        | :TQ10001     |       |       |  Q1,01 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE            |              |       |       |  Q1,01 | PCWP |            |
|*  5 |      HASH JOIN                |              |       |       |  Q1,01 | PCWP |            |
|   6 |       BUFFER SORT             |              |       |       |  Q1,01 | PCWC |            |
|   7 |        PART JOIN FILTER CREATE| :BF0000      |       |       |  Q1,01 | PCWP |            |
|   8 |         PX RECEIVE            |              |       |       |  Q1,01 | PCWP |            |
|   9 |          PX SEND BROADCAST    | :TQ10000     |       |       |        | S->P | BROADCAST  |
|* 10 |           TABLE ACCESS FULL   | DATE_DIM     |       |       |        |      |            |
|  11 |       PX BLOCK ITERATOR       |              |       |       |  Q1,01 | PCWC |            |
|* 12 |        TABLE ACCESS FULL      | STORE_SALES  |:BF0000|:BF0000|  Q1,01 | PCWP |            |
---------------------------------------------------------------------------------------------------

Where the bloom pruning is on the TABLE ACCESS FULL row source. This is because there is no Pstart/Pstop for a PX BLOCK ITERATOR row source (it’s block ranges, so partition information is lost – it had been contained in level above this).

Hopefully this helps you understand and correctly identify execution plans contain bloom pruning even though at first glance you may not think they do. If you are uncertain, use the execution stats for the query looking at metrics like amount of data read and execution times to provide some empirical insight.

Oracle Exadata and Netezza TwinFin Compared – An Engineer’s Analysis

There seems to be little debate that Oracle’s launch of the Oracle Exadata Storage Server and the Sun Oracle Database Machine has created buzz in the database marketplace. Apparently there is so much buzz and excitement around these products that two competing vendors, Teradata and Netezza, have both authored publications that contain a significant amount of discussion about the Oracle Database with Real Application Clusters (RAC) and Oracle Exadata. Both of these vendor papers are well structured but make no mistake, these are marketing publications written with the intent to be critical of Exadata and discuss how their product is potentially better. Hence, both of these papers are obviously biased to support their purpose.

My intent with this blog post is simply to discuss some of the claims, analyze them for factual accuracy, and briefly comment on them. After all, Netezza clearly states in their publication:

The information shared in this paper is made available in the spirit of openness. Any inaccuracies result from our mistakes, not an intent to mislead.

In the interest of full disclosure, my employer is Oracle Corporation, however, this is a personal blog and what I write here are my own ideas and words (see disclaimer on the right column). For those of you who don’t know, I’m a database performance engineer with the Real-World Performance Group which is part of Server Technologies. I’ve been working with Exadata since before it was launched publicly and have worked on dozens of data warehouse proofs-of-concept (PoCs) running on the Exadata powered Sun Oracle Database Machine. My thoughts and comments are presented purely from an engineer’s standpoint.

The following writings are the basis of my discussion:

  1. Teradata: Exadata – the Sequel: Exadata V2 is Still Oracle
  2. Daniel Abadi: Defending Oracle Exadata
  3. Netezza: Oracle Exadata and Netezza TwinFin™ Compared

If you have not read Daniel Abadi’s blog post I strongly suggest you do before proceeding further. I think it is very well written and is presented from a vendor neutral point of view so there is no marketing gobbledygook to sort through. Several of the points in the Teradata writing which he discusses are also presented (or similarly presented) in the Netezza eBook, so you can relate his responses to those arguments as well. Since I feel Daniel Abadi did an excellent job pointing out the major flaws with the Teradata paper, I’m going to limit my discuss to the Netezza eBook.

Understanding Exadata Smart Scan

As a prerequisite for the discussion of the Netezza and Teradata papers, it’s imperative that we take a minute to understand the basics of Exadata Smart Scan. The Smart Scan optimizations include the following:

  • Data Elimination via Storage Indexes
  • Restriction/Row Filtering/Predicate Filtering
  • Projection/Column Filtering
  • Join Processing/Filtering via Bloom Filters and Bloom Pruning

The premise of these optimizations is reduce query processing times in the following ways:

  • I/O Elimination – don’t read data off storage that is not needed
  • Payload Reduction – don’t send data to the Oracle Database Servers that is not needed

OK. Now that you have a basic understanding, let’s dive into the claims…

Netezza’s Claims

Let’s discuss a few of Netezza claims against Exadata:

Claim: Exadata Smart Scan does not work with index-organized tables or clustered tables.

While this is a true statement, its intent is clearly to mislead you. Both of these structures are designed for OLTP workloads, not data warehousing. In fact, if one were to actually read the Oracle Database 11.2 documentation for index-organized tables you would find the following (source):

Index-organized tables are ideal for OLTP applications, which require fast primary key access

If one were to research table clusters you would find the Oracle Database 11.2 documentation offers the following guidelines (source):

Typically, clustering tables is not appropriate in the following situations:

  • The tables are frequently updated.
  • The tables frequently require a full table scan.
  • The tables require truncating.

As anyone can see from the Oracle Database 11.2 Documentation, neither of these structures are appropriate for data warehousing.

Apparently this was not what Netezza really wanted you to know so they uncovered a note on IOTs from almost a decade ago, dating back to 2001 – Oracle 9i time frame, that while it clearly states:

[an IOT] enables extremely fast access to table data for primary key based [OLTP] queries

it also suggests that an IOT may be used as a fact table. Clearly this information is quite old and outdated and should probably be removed. What was a recommendation for Oracle Database 9i Release 1 in 2001 is not necessarily a recommendation for Oracle Database 11g Release 2 in 2010. Technology changes so using the most recent recommendations as a basis for discussion is appropriate, not some old, outdated stuff from nearly 10 years ago. Besides, the Oracle Database Machine runs version 11g Release 2, not 9i Release 1.

Bottom line: I’d say this “limitation” has an impact on a nice round number of Exadata data warehouse customers – exactly zero (zero literally being a round number). IOTs and clustered tables are both structures optimized for fast primary key access, like the type of access in OLTP workloads, not data warehousing. The argument that Smart Scan does not work for these structures is really no argument at all.

Claim: Exadata Smart Scan does not work with the TIMESTAMP datatype.

Phil Francisco seems to have left out some very important context in making this accusation, because this is not at all what the cited blog post by Christian Antognini discusses. This post clearly states the discussion is about:

What happens [with Smart Scan] when predicates contain functions or expressions?

Nowhere at all does that post make an isolated reference that Smart Scan does not work with the TIMESTAMP datatype. What this blog post does state is this:

when a TIMESTAMP datatype is involved [with datetime functions], offloading almost never happens

While the Netezza paper references what the blog post author has written, some very important context has been omitted. In doing so, Netezza has taken a specific reference and turned it into a misleading generalization.

The reality is that Smart Scan does indeed work for the TIMESTAMP datatype and here is a basic example to demonstrate such:

SQL> describe t
 Name           Null?    Type
 -------------- -------- ------------------
 ID             NOT NULL NUMBER
 N                       NUMBER
 BF                      BINARY_FLOAT
 BD                      BINARY_DOUBLE
 D                       DATE
 T                       TIMESTAMP(6)
 S                       VARCHAR2(4000)    

SQL> SELECT * FROM t WHERE t = to_timestamp('01-01-2010','DD-MM-YYYY');

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

----------------------------------------------------------------------------------
| Id  | Operation                 | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |      |     1 |    52 |     4   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS STORAGE FULL| T    |     1 |    52 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - storage("T"=TIMESTAMP' 2010-01-01 00:00:00.000000000')
       filter("T"=TIMESTAMP' 2010-01-01 00:00:00.000000000')

You can see that the Smart Scan offload is taking place by the presence of the storage clause (highlighted) in the Predicate Information section above. What Christian Antognini did observe is bug 9682721 and the bugfix resolves the datetime function offload issues for all but a couple scenarios (which he blogs about here) and those operations can (and usually are) expressed differently. For example, an expression using ADD_MONTHS() can easily be expressed using BETWEEN.

Bottom line: Exadata Smart Scan does work with the TIMESTAMP datatype.

Claim: When transactions (insert, update, delete) are operating against the data warehouse concurrent with query activity, smart scans are disabled. Dirty buffers turn off smart scan.

Yet again, Netezza presents only a half-truth. While it is true that an active transaction disables Smart Scan, they fail to further clarify that Smart Scan is only disabled for those blocks that contain an active transaction – the rest of the blocks are able to be Smart Scanned. The amount of data that is impacted by insert, update, delete will generally be a very small fraction of the total data in a data warehouse. Also, data that is inserted via direct path operations is not subject to MVCC (the method Oracle uses for read consistency) as the blocks that are used are new blocks so no read consistent view is needed.

Bottom line: While this claim is partially true, it clearly attempts to overstate the impact of this scenario in a very negative way. Not having Smart Scan for small number of blocks will have a negligible impact on performance.

Also see Daniel Abadi: Exadata does NOT Support Active Data Warehousing

Claim: Using [a shared-disk] architecture for a data warehouse platform raises concern that contention for the shared resource imposes limits on the amount of data the database can process and the number of queries it can run concurrently.

It is unclear what resource Netezza is referring to here, it simply states “the shared resource”. You know the one? Yes, that one… Perhaps they mean the disks themselves, but that is completely unknown. Anyway…

Exadata uses at least a 4 MB Automatic Storage Management (ASM) allocation unit (AU) [more on ASM basics]. This means that there is at least 4 MB of contiguous physical data laid out on the HDD which translates into 4 MB of contiguous data streamed off of disk for full table scans before the head needs to perform a seek. With such large I/O requests the HDDs are able to spend nearly all the time transferring data, and very little time finding it and that is what matters most. Clearly if Exadata is able to stream data off of disk at 125 MB/s per disk (near physics speed for this type of workload) then any alleged “contention” is really not an issue. In many multi-user data warehouse workloads for PoCs, I’ve observed that each Exadata Storage Server is able to perform very close or at the data sheet physical HDD I/O rate of 1500 MB/s per server.

Bottom line: The scalability differences between shared-nothing and shared-disk are very much exaggerated. By doing large sequential I/Os the disk spends its time returning data, not finding it. Simply put – there really is no “contention”.

Also see Daniel Abadi: 1) Exadata does NOT Enable High Concurrency & 2) Exadata is NOT Intelligent Storage; Exadata is NOT Shared-Nothing

Claim: Analytical queries, such as “find all shopping baskets sold last month in Washington State, Oregon and California containing product X with product Y and with a total value more than $35″ must retrieve much larger data sets, all of which must be moved from storage to database.

I find it so ironic that Netezza mentions this type of query as nearly an identical (but more complex) one was used by my group at Oracle OpenWorld 2009 in our The Terabyte Hour with the Real-World Performance Group session. The exact analytical query we ran live for the audience to demonstrate the features of Oracle Exadata and the Oracle Database Machine was, “What were the most popular items in the baskets of shoppers who visited stores in California in the first week of May and did not buy bananas?”

Let’s translate the Netezza analytical question into some tables and SQL to see what the general shape of this query may look like:

select
   count(*)  -- using count(*) for simplicity of the example
from (
   select
      td.transaction_id,
      sum(td.sales_dollar_amt) total_sales_amt,
      sum(case when p.product_description in ('Brand #42 beer') then 1 else 0 end) count_productX,
      sum(case when p.product_description in ('Brand #42 frozen pizza') then 1 else 0 end) count_productY
   from transaction_detail td
      join d_store s   on (td.store_key = s.store_key)
      join d_product p on (td.product_key = p.product_key)
   where
      s.store_state in ('CA','OR','WA') and
      td.transaction_time >= timestamp '2010-07-01 00:00:00' and
      td.transaction_time <  timestamp '2010-08-01 00:00:00'
   group by td.transaction_id
) x
where
   total_sales_amt > 35 and
   count_productX > 0 and
   count_productY > 0

To me, this isn’t a particularly complex analytical question/query. As written, it’s just a 3 table join (could be 4 if I added a D_DATE I suppose), but it doesn’t require anything fancy – just a simple GROUP BY with a CASE in the SELECT to count how many times Product X and Product Y appear in a given basket.

Netezza claims that analytical queries like this one must move all the data from storage to the database, but that simply is not true. Here is why:

  1. Simple range partitioning on the event timestamp (a very common data warehousing practice for those databases that support partitioning), or even Exadata Storage Indexes, will eliminate any I/O for data other than the one month window that is required for this query.
  2. A bloom filter can be created and pushed into Exadata to be used as a storage filter for the list of STORE_KEY values that represent the three state store restriction.

Applying both of #1 and #2, the only data that is returned to the database for the fact table are rows for stores in Washington State, Oregon and California for last month. Clearly this is only a subset of the data for the entire fact table.

This is just one example, but there are obviously different representations of the same data and query that could be used. I chose what I thought was the most raw, unprocessed, uncooked form simply because Netezza seems to boast about brute force type of operations. Even then, considering a worst case scenario, Exadata does not have to move all the data back to the database. Other data/table designs that I’ve seen from customers in the retail business would allow even less data to be returned.

Bottom line: There are numerous ways that Exadata can restrict the data that is set to the database servers and it’s likely that any query with any predicate restrictions can do so. Certainly it is possible even with the analytic question that Netezza mentions.

Claim: To evenly distribute data across Exadata’s grid of storage servers requires administrators trained and experienced in designing, managing and maintaining complex partitions, files, tablespaces, indices, tables and block/extent sizes.

Interestingly enough, the author of the Teradata paper seems to have a better grasp than Netezza on how data distribution and ASM work describing it on page 9:

Distribution of data on Exadata storage is managed by Oracle’s Automatic Storage Manager (ASM). By default, ASM stripes each Oracle data partition across all available disks on every Exadata cell.

So if by default ASM evenly stripes data across all available disks on Exadata Storage Server (and it does, in a round robin manner) what exactly is so difficult here? What training and experience is really required for something that does data distribution automatically? I can only assert that Phil Francisco has not even read the Teradata paper (but it would seem he has since he even mentions it on his blog), let alone Introduction to Oracle Automatic Storage Management. It’s claims like this that really make me question how genuine his “no intent to mislead” statement really is.

Bottom line: Administrators need not worry about data distribution with Exadata and ASM – it is done automatically and evenly for you.

Conclusion

I’m always extremely reluctant to believe much of what vendors say about other vendors, especially when they preface their publication with something like: “One caveat: Netezza has no direct access to an Exadata machine“, and “Any inaccuracies result from our mistakes, not an intent to mislead” yet they still feel qualified enough to write about said technology and claim it as fact. I also find it interesting that both Teradata and Netezza have published anti-Exadata papers, but neither Netezza nor Teradata have published anti-vendor papers about each other (that I know of). Perhaps Exadata is much more of a competitor than either of them let on. They do protest too much, methinks.

The list of claims I’ve discussed certainly is not an exhaustive list by any means but I think it is fairly representative of the quality found in Netezza’s paper. While sometimes the facts are correct, the arguments are overstated and misleading. Other times, the facts are simply wrong. Netezza clearly attempts to create the illusion of problems simply where they do not exist.

Hopefully this blog post has left you a more knowledgeable person when it comes to Oracle and Exadata. I’ve provided fact and example wherever possible and kept assertions to a minimum.

I’d like to end with a quote from Daniel Abadi’s response to the Teradata paper which I find more than applicable to the Netezza paper as well:

Many of the claims and inferences made in the paper about Exadata are overstated, and the reader needs to be careful not to be mislead into believing in the existence problems that don’t actually present themselves on realistic data sets and workloads.

Courteous and professional comments are welcome. Anonymous comments are discouraged. Snark and flame will end up in the recycle bin. Comment accordingly.

The Core Performance Fundamentals Of Oracle Data Warehousing – Set Processing vs Row Processing

[back to Introduction]

In over six years of doing data warehouse POCs and benchmarks for clients there is one area that I frequently see as problematic: “batch jobs”.  Most of the time these “batch jobs” take the form of some PL/SQL procedures and packages that generally perform some data load, transformation, processing or something similar.  The reason these are so problematic is that developers have hard-coded “slow” into them.  I’m generally certain these developers didn’t know they had done this when they coded their PL/SQL, but none the less it happened.

So How Did “Slow” Get Hard-Coded Into My PL/SQL?

Generally “slow” gets hard-coded into PL/SQL because the PL/SQL developer(s) took the business requirements and did a “literal translation” of each rule/requirement one at a time instead of looking at the “before picture” and the “after picture” and determining the most efficient way to make those data changes.  Many times this can surface as cursor based row-by-row processing, but it also can appear as PL/SQL just running a series of often poorly thought out SQL commands.

Hard-Coded Slow Case Study

The following is based on a true story. Only the facts names have been changed to protect the innocent.

Here is a pseudo code snippet based on a portion of some data processing I saw in a POC:

{truncate all intermediate tables}
insert into temp1 select * from t1 where create_date = yesterday;
insert into temp1 select * from t2 where create_date = yesterday;
insert into temp1 select * from t3 where create_date = yesterday;
insert into temp2 select * from temp1 where {some conditions};
insert into target_table select * from temp2;
for each of 20 columns
loop
  update target_table t
    set t.column_name =
      (select column_name
       from t4
       where t.id=t4.id )
    where i.column_name is null;
end loop
update target_table t set {list of 50 columns} = select {50 columns} from t5 where t.id=t5.id;

I’m going to stop there as any more of this will likely make you cry more than you already should be.

I almost hesitate to ask the question, but isn’t it quite obvious what is broken about this processing?  Here’s the major inefficiencies as I see them:

  • What is the point of inserting all the data into temp1, only then to filter some of it out when temp2 is populated.  If you haven’t heard the phrase “filter early” you have some homework to do.
  • Why publish into the target_table and then perform 20 single column updates, followed by a single 50 column update?  Better question yet: Why perform any bulk updates at all?  Bulk updates (and deletes) are simply evil – avoid them at all costs.

So, as with many clients that come in and do an Exadata Database Machine POC, they really weren’t motivated to make any changes to their existing code, they just wanted to see how much performance the Exadata platform would give them.  Much to their happiness, this reduced their processing time from over 2.5 days (weekend job that started Friday PM but didn’t finish by Monday AM) down to 10 hours, a savings of over 2 days (24 hours).  Now, it could fail and they would have time to re-run it before the business opened on Monday morning.  Heck, I guess if I got back 24 hours out of 38 I’d be excited too, if I were not a database performance engineer who knew there was even more performance left on the table, waiting to be exploited.

Feeling unsatisfied, I took it upon myself to demonstrate the significant payback that re-engineering can yield on the Exadata platform and I coded up an entirely new set-based data flow in just a handful of SQL statements (no PL/SQL).  The result: processing an entire week’s worth of data (several 100s of millions of rows) now took just 12 minutes.  That’s right — 7 days worth of events scrubbed, transformed, enriched and published in just 12 minutes.

When I gently broke the news to this client that it was possible to load the week’s events in just 12 minutes they were quite excited (to say the least).  In fact, one person even said (a bit out of turn), “well, that would mean that a single day’s events could be loaded in just a couple minutes and that would give a new level of freshness to the data which would allow the business to make faster, better decisions due to the timeliness of the data.”  My response: “BINGO!”  This client now had the epiphany of what is now possible with Exadata where previously it was impossible.

It’s Not a Need, It’s a Want

I’m not going to give away my database engineer hat for a product marketing hat just yet (or probably ever), but this is the reality that exists.  IT shops started with small data sets and use small data set programming logic on their data, and that worked for some time.  The reason: because inefficient processing on a small data set is only a little inefficient, but the same processing logic on a big data set is very inefficient.  This is why I have said before: In oder to fully exploit the Oracle Exadata platform (or any current day platform) some re-engineering may be required. Do not be mistaken — I am not saying you need to re-engineer your applications for Exadata.  I am saying you will want to re-engineer your applications for Exadata as those applications simply were not designed to leverage the massively parallel processing that Exadata allows one to do.  It’s time to base design decisions based on today’s technology, not what was available when your application was designed.  Fast forward to now.

The Core Performance Fundamentals Of Oracle Data Warehousing – Data Loading

[back to Introduction]

Getting flat file data into your Oracle data warehouse is likely a daily (or more possibly frequent) task, but it certainly does not have to be a difficult one.  Bulk loading data rates are governed by the following operations and hardware resources:

  1. How fast can the data be read
  2. How fast can data be written out
  3. How much CPU power is available

I’m always a bit amazed (and depressed) when I hear people complain that their data loading rates are slow and they proceed to tell me things like:

  • The source files reside on a shared NFS filer (or similar) and it has just a single GbE (1 Gigabit Ethernet) network path to the Oracle database host(s).
  • The source files reside on this internal disk volume which consists of a two disk mirror (or a volume with very few spindles).

Maybe it’s not entirely obvious so let me spell it out (as I did in this tweet):

One can not load data into a database faster than it can be delivered from the source. Database systems must obey the laws of physics!

Or putting it another way: Don’t fall victim to slow data loading because of a slow performing data source.

Given a system that can provide data at fast enough rates, the loading rate becomes a factor of point #2 and #3. The database operations can be simplified to:

  1. Read lines from flat files
  2. Process lines into columns, internal data types and optionally compress
  3. Write rows/columns out to disk

In most cases a reasonably size system becomes CPU bound, not write bound, on data loads as almost all Oracle data warehouses use compression which increases CPU consumption but reduces the IO requirement for the writes.  Or putting it another way:  Bulk loading into a compressed table should be a CPU bound operation, not a disk (write) bound operation.

Data Loading Best Practices (What To Do and Why To Do It)

Oracle offers two methods to load data from flat files: 1) SQL*Loader and 2) External Tables.  I would highly recommend that bulk loads (especially PDML loads) be done via External Tables and SQL*Loader only be used for non-parallel loads (PARALLEL=false) with small amounts of data (not bulk loads).  The high level reason for this recommendation is that External Tables have nearly all the SQL functionality of a heap table and allow numerous more optimizations than SQL*Loader does and there are some undsireable side effects (mostly in the space management layer) from using PARALLEL=true with SQL*Loader.

In order to avoid the reading of the flat files being the bottleneck, use a filesystem that is backed by numerous spindles (more than enough to provide the desired loading rate) and consider using compressed files in conjunction with the external table preprocessor.  Using the preprocessor is especially useful if  there are proportionally more CPU resources on the database system than network or disk bandwidth because the use of compression on the source files allows for a larger logical row “delivery” for a given file size.  Something that may not be obvious either is to put the flat files on a filesystem that is mounted using directio mount options.  This will eliminate the file system cache being flooded with data that will (likely) never be read again (how many times do you load the same files?).  Another option that becomes available with Oracle 11.2 is DBFS (database filesystem) and is what is frequently used with the Oracle Database Machines & Exadata which is a fast and scalable solution for staging flat files.

In order to achieve the best loading speeds be sure to:

  • Use External Tables
  • Use a staging filesystem (and network) fast enough to meet your loading speed requirements (and consider directio mount options)
  • Use Parallel Execution (parallel CTAS or PDML INSERT)
  • Use direct-path loads (nologging CTAS or INSERT /*+ APPEND */)
  • Use a large enough initial/next extent size (8MB is usually enough)

If you follow these basic recommendations you should be able to achieve loading speeds that easily meet your requirements (otherwise you likely just need more hardware).

Loading Data At Ludicrous Speed

I’ve yet to come across a reasonably designed system that is capable of becoming write bound as systems simply either 1) do not have enough CPU to do so or 2) are unable to read the source flat files anywhere near fast enough to do so.  I have, however, conducted experiments to test write throughput of a Sun Oracle Database Machine (Exadata V2) by using flat files cached completely in the filesystem cache and referencing them numerous times in the External Table DDL. The results should be quite eye opening for many, especially those who think the Oracle database can not load data fast.  Loading into an uncompressed table, I was able load just over 1TB of flat file data (over 7.8 billion rows) in a mear 4.6 minutes (275 seconds).  This experiment does not represent typical loading speed rates as it’s unlikely the source files are on a filesystem as fast as main memory, but it does demonstrate that if the flat file data could be delivered at such rates, the Oracle software and hardware can easily load it at close to physics speed (the max speed the hardware is capable of).

SQL> create table fastload
  2  pctfree 0
  3  parallel
  4  nologging
  5  nocompress
  6  storage(initial 8m next 8m)
  7  tablespace ts_smallfile
  8  as
  9  select * from et_fastload;

Table created.

Elapsed: 00:04:35.49

SQL> select count(*) from fastload;

     COUNT(*)
-------------
7,874,466,950

Elapsed: 00:01:06.54

SQL> select ceil(sum(bytes)/1024/1024) mb from user_segments where segment_name='FASTLOAD';

       MB
---------
1,058,750

SQL> exec dbms_stats.gather_table_stats(user,'FASTLOAD');

PL/SQL procedure successfully completed.

SQL> select num_rows,blocks,avg_row_len from user_tables where table_name='FASTLOAD';

  NUM_ROWS     BLOCKS AVG_ROW_LEN
---------- ---------- -----------
7874466950  135520008         133

Just so you don’t think I’m making this stuff up, check out the SQL Monitor Report for the execution, noting the IO throughput graph from the Metrics tab (10GB/s write throughput isn’t half bad).

So as you can see, flat file data loading has really become more of a data delivery problem rather than a data loading problem.  If the Oracle Database, specifically the Exadata powered Oracle Database Machine, can bulk load data from an external table whose files reside in the filesystem cache at a rate of 13TB per hour (give or take), you probably don’t have to worry too much about meeting your data loading rate business requirements (wink).

Note: Loading rates will vary slightly depending on table definition, number of columns, data types, compression type, etc.

References

The Core Performance Fundamentals Of Oracle Data Warehousing – Parallel Execution

[back to Introduction]

Leveraging Oracle’s Parallel Execution (PX) in your Oracle data warehouse is probably the most important feature/technology one can use to speed up operations on large data sets.  PX is not, however, “go fast” magic pixi dust for any old operation (if thats what you think, you probably don’t understand the parallel computing paradigm). With Oracle PX, a large task is broken up into smaller parts, sub-tasks if you will, and each sub-task is then worked on in parallel.  The goal of Oracle PX: divide and conquer.  This allows a significant amount of hardware resources to be engaged in solving a single problem and is what allows the Oracle database to scale up and out when working with large data sets.

I though I’d touch on some basics and add my observations but this is by far not an exhaustive write up on Oracle’s Parallel Execution.  There is an entire chapter in the Oracle Database documentation on PX as well as several white papers.  I’ve listed all these in the Resources section at the bottom of this post.  Read them, but as always, feel free to post questions/comments here.  Discussion adds great value.

A Basic Example of Parallel Execution

Consider a simple one table query like the one below.

You can see that the PX Coordinator (also known as the Query Coordinator or QC) breaks up the “work” into several chunks and those chunks are worked on by the PX Server Processes.  The technical term for the chunk a PX Server Process works on is called a granule.  Granules can either be block-based or partition-based.

When To Use Parallel Execution

PX is a key component in data warehousing as that is where large data sets usually exist.  The most common operations that use PX are queries (SELECTs) and data loads (INSERTs or CTAS).  PX is most commonly controlled by using the PARALLEL attribute on the object, although it can be controlled by hints or even Oracle’s Database Resource Manager.  If you are not using PX in your Oracle data warehouse than you are probably missing out on a shedload of performance opportunity.

When an object has its PARALLEL attribute set or the PARALLEL hint is used queries will leverage PX, but to leverage PX for DML operations (INSERT/DELETE/UPDATE) remember to alter your session by using the command:

alter session [enable|force] parallel dml;

Do Not Fear Parallel Execution

Since Oracle’s PX is designed to take advantage of multiple CPUs (or CPU cores) at a time, it can leverage significant hardware resources, if available.  From my experiences in talking with Oracle DBAs, the ability for PX to do this scares them. This results in DBAs implementing a relatively small degree of parallelism (DOP) for a system that could possibly support a much higher level (based on #CPUs).  Often times though, the system that PX is being run on is not a balanced system and frequently has much more CPU power than disk and channel bandwidth, so data movement from disk becomes the bottleneck well before the CPUs are busy.  This results in many statements like “Parallel Execution doesn’t work” or similar because the user/DBA isn’t observing a decrease in execution time with more parallelism.  Bottom line:  if the hardware resources are not available, the software certainly can not scale.

Just for giggles (and education), here is a snippet from top(1) from a node from an Oracle Database Machine running a single query (across all 8 database nodes) at DOP 256.

top - 20:46:44 up 5 days,  3:48,  1 user,  load average: 36.27, 37.41, 35.75
Tasks: 417 total,  43 running, 373 sleeping,   0 stopped,   1 zombie
Cpu(s): 95.6%us,  1.6%sy,  0.0%ni,  2.2%id,  0.0%wa,  0.2%hi,  0.4%si,  0.0%st
Mem:  74027752k total, 21876824k used, 52150928k free,   440692k buffers
Swap: 16771852k total,        0k used, 16771852k free, 13770844k cached

USER       PID  PR  NI  VIRT  SHR  RES S %CPU %MEM    TIME+  COMMAND
oracle   16132  16   0 16.4g 5.2g 5.4g R 63.8  7.6 709:33.02 ora_p011_orcl
oracle   16116  16   0 16.4g 4.9g 5.1g R 60.9  7.2 698:35.63 ora_p003_orcl
oracle   16226  15   0 16.4g 4.9g 5.1g R 59.9  7.2 702:01.01 ora_p028_orcl
oracle   16110  16   0 16.4g 4.9g 5.1g R 58.9  7.2 697:20.51 ora_p000_orcl
oracle   16122  15   0 16.3g 4.9g 5.0g R 56.9  7.0 694:54.61 ora_p006_orcl

(Quite the TIME+ column there, huh!)

Summary

In this post I’ve been a bit light on the technicals of PX, but that is mostly because 1) this is a fundamentals post and 2) there is a ton of more detail in the referenced documentation and I really don’t feel like republishing what already exists. Bottom line, Oracle Parallel Execution is a must for scaling performance in your Oracle data warehouse.  Take the time to understand how to leverage it to maximize performance in your environment and feel free to start a discussion here if you have questions.

References

The Core Performance Fundamentals Of Oracle Data Warehousing – Partitioning

[back to Introduction]

Partitioning is an essential performance feature for an Oracle data warehouse because partition elimination (or partition pruning) generally results in the elimination of a significant amount of table data to be scanned. This results in a need for less system resources and improved query performance. Someone once told me “the fastest I/O is the one that never happens.” This is precisely the reason that partitioning is a must for Oracle data warehouses – it’s a huge I/O eliminator. I frequently refer to partition elimination as the anti-index. An index is used to find a small amount data that is required; partitioning is used to eliminate vasts amounts of data that is not required.

Main Uses For Partitioning

I would classify the main reasons to use partitioning in your Oracle data warehouse into these four areas:

  • Data Elimination
  • Partition-Wise Joins
  • Manageability (Partition Exchange Load, Local Indexes, etc.)
  • Information Lifecycle Management (ILM)

Partitioning Basics

The most common partitioning design pattern found in Oracle data warehouses is to partition the fact tables by range (or interval) on the event date/time column. This allows for partition elimination of all the data not in the desired time window in queries. For example: If I have a fact table that contains point of sale (POS) data, each line item for a given transaction has a time stamp of when the item was scanned. Let’s say this value is stored in column EVENT_TS which is a DATE or TIMESTAMP data type. In most cases it would make sense to partition by range on EVENT_TS using one day partitions. This means every query that uses a predicate filter on EVENT_TS (which should be nearly every one) can eliminate significant amounts of data that is not required to satisfy the query predicate.  If you want to look at yesterday’s sales numbers, there is no need to bring back rows from last week or last month!

Subpartitioning Options

Depending on the schema design of your data warehouse you may also chose to subpartition a table. This allows one to further segment a table to allow for even more data elimination or it can allow for partition-wise joins which allow for reduced usage of CPU and memory resources by minimizing the amount of data exchanged between parallel execution server processes. In third normal form (3NF) schemas it is very beneficial to use hash partitioning or subpartitioning to allow for partition-wise joins (see Oracle Parallel Execution: Interconnect Myths And Misunderstandings) for this exact reason. Dimensional models (star schemas) may also benefit from hash subpartitioning and partition-wise joins. Generally it is best to hash subpartition on a join key column to a very large dimension, like CUSTOMER, so that a partition-wise join will be used between the fact table and the large dimension table.

Manageability

Managing large objects can be challenging for a variety of reasons which is why Oracle Partitioning allows for many operations to be done at a global or partition (or subpartition) level.  This makes it much easier to deal with tables or indexes of large sizes.  It also is transparent to applications so the SQL that runs against a non-partitioned object will run as-is against a partitioned object.  Some of the key features include:

  • Partition Exchange Load – Data can be loaded “out of line” and exchanged into a partitioned table.
  • Local Indexes – It takes much less time to build local indexes than global indexes.
  • Compression – Can be applied at the segment level so it’s possible to have a mix of compressed and non-compressed partitions.
  • Segment Moves/Rebuilds/Truncates/Drops – Each partition (or subpartition) is a segment and can be operated on individually and independently of the other partitions in the table.
  • Information Lifecycle Management (ILM) – Partitioning allows implementation of an ILM strategy.

Summary

I’d classify partitioning as a “must have” for Oracle data warehouses for both the performance and manageability reasons described above. Partitioning should lower query response time as well as resource utilization do to “smart” data access (only go after the data the query needs). There are additional partitioning design patterns as well and the Oracle documentation contains descriptions of them as well as examples.

Oracle Documentation References: