Category: Data Warehousing

Facebook: Hive – A Petabyte Scale Data Warehouse Using Hadoop

Today, June 10th, marks the Yahoo! Hadoop Summit ’09 and the crew at Facebook have a writeup on the Facebook Engineering page entitled: Hive – A Petabyte Scale Data Warehouse Using Hadoop.

I found this an very interesting read given some of the Hadoop/MapReduce comments from David J. DeWitt and Michael Stonebraker as well as their SIGMOD 2009 paper, A Comparison of Approaches to Large-Scale Data Analysis. Now I’m not about to jump into this whole dbms-is-better-than-mapreduce argument but I found Facebook’s story line interesting:

When we started at Facebook in 2007 all of the data processing infrastructure was built around a data warehouse built using a commercial RDBMS. The data that we were generating was growing very fast – as an example we grew from a 15TB data set in 2007 to a 2PB data set today. The infrastructure at that time was so inadequate that some daily data processing jobs were taking more than a day to process and the situation was just getting worse with every passing day. We had an urgent need for infrastructure that could scale along with our data and it was at that time we then started exploring Hadoop as a way to address our scaling needs.

[The] Hive/Hadoop cluster at Facebook stores more than 2PB of uncompressed data and routinely loads 15 TB of data daily

Wow, 2PB of uncompressed data and growing at around 15TB daily. A part of me wonders how much value there is in 2PB of data or if companies are suffering from OCD when it comes to data. Either way it’s interesting to see how much data is being generated/collected and how engineers are dealing with it.

Oracle And HP Take Back #1 Spot For 1TB TPC-H Benchmark

Oracle and HP have taken back the #1 spot by setting a new performance record in the 1TB TPC-H benchmark. The HP/Oracle result puts the Oracle database ahead of both the Exasol (currently #2 & #3) and ParAccel (currently #4) results in the race for performance at the 1TB scale factor and places Oracle in the >1 million queries per hour (QphH) club, which is no small achievement. Compared to the next best result from HP/Oracle (currently #5), this result has over 9X the query throughput (1,166,976 QphH vs. 123,323 QphH) at around 1/4 the cost (5.42 USD vs. 20.54 USD) demonstrating significantly more performance for the money.

Some of the interesting bits from the hardware side:

  • 4 HP BladeSystem c7000 Enclosures
  • 64 HP ProLiant BL460c Servers
  • 128 Quad-Core Intel Xeon X5450 “Harpertown” Processors (512 cores)
  • 2TB Total System Memory (RAM)
  • 6 HP Oracle Exadata Storage Servers

As you can see, this was a 64 node Oracle Real Application Cluster (RAC), each node having 2 processors (8 cores). This is also the first TPC-H benchmark from Oracle that used Exadata as the storage platform.

Congratulation to the HP/Oracle team on the great accomplishment!

Transaction Processing Performance Council_1244094205417.png

Intel Nehalem-EP Xeon 5500 Series Processors Makes Databases Go 2X Faster

As a database performance engineer there are certain things that get me really excited. One of them is hardware. Not just any hardware, but the latest, greatest, bleeding edge stuff. It is especially exciting when the latest generation of CPUs are twice as fast as the previous generation, and those being no slouch. This is how Intel’s new Nehalem-EP Xeon 5500 series processors are shaping up.

The big launch was on March 30th so in the past few days all the benchmark reports and blog posts have been rolling in. Here are a few that I think are worth highlighting:

The SQL Server Performance Blog reports:

Pat Gelsinger did a side-by-side performance demo which launched an SSRS report, running reporting queries against a 1.5 TB SSAS OLAP cube, built using a Microsoft adCenter data set. The demo showed how Nehalem-EP is 2X faster than a Xeon 5400 on the same workload, with the same DRAM and I/O configuration. Not too shabby, but we’ve seen even faster results (~3-4X faster) on workloads which are more memory bandwidth-intensive, like data warehousing or in-memory OLAP workloads.

Intel’s Dave Hill over at the The Server Room Blog writes:

As of March 30, 2009, Intel based 2 socket Xeon® 5500 series servers set at least 30 world performance records across a wide range of benchmarks that cover virtually every application type on the market. The performance results, just by themselves, are utterly amazing, and in general they are greater than 2x the Intel® Xeon® 5400 series processors (Harpertown).

There are numerous other benchmarks listed over at the Intel® Xeon® Processor Performance summary page. Check them out. You should be nothing less than amazed. It surely is a great time to be using commodity hardware and if you are not, perhaps you should be! And for those database vendors who are using proprietary hardware like FPGAs, well, I guess you are wishing that Intel’s Nehalem-EP processors are an April Fools’ joke, but you would be wrong.

The Impact Of Good Table And Query Design

There are many ways to design tables/schemas and many ways to write SQL queries that execute against those tables/schemas. Some designs are better than others for various reasons, however, I think that frequently people underestimate the power of SQL (for both “good” and “evil”). All too often in data warehouses, I see tables designed for one specific report, or a very select few reports. These tables frequently resemble Microsoft Excel Spreadsheets (generally Pivot Tables), not good Dimensional (Star Schema) or Third Normal Form (3NF) schema design. The problem with such designs is that it severely limits the usefulness of that data, as queries that were not known at the time of design often time become problematic. The following is a simple one table example, derived from a field experience in which I discuss two table designs and provide the SQL queries to answer a question the business is seeking.

The Business Question

First lets start with the business question for which the answer is being sought.
What customers meet the following criteria:

  • do not own PRODUCT1 or PRODUCT2 but have downloaded SOFTWARE
  • do not own PRODUCT2 and it has been more than 90 days between SOFTWARE download and their purchase of PRODUCT1

Version 1: The Column Based (Pivot) Table Design

For Version 1, there is a single row for each customer and each attribute has its own column. In this case there are 4 columns, each representing the most recent activity date for that product.

SQL> desc column_tab
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUSTOMER_ID                               NOT NULL NUMBER
 SOFTWARE_MAC_RECENCY_TS                            DATE
 SOFTWARE_WIN_RECENCY_TS                            DATE
 PRODUCT1_RECENCY_TS                                DATE
 PRODUCT2_RECENCY_TS                                DATE

SQL> select * from column_tab;

CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
----------- ---------- ---------- ---------- ----------
        100 2009-03-17            2008-11-17
        200 2009-03-17            2009-01-16
        300 2009-03-17            2008-10-08 2009-02-25
        400            2009-03-17 2008-11-07
        500 2009-03-17

5 rows selected.

SQL> select customer_id
  2  from   column_tab
  3  where  product2_recency_ts is null and
  4         (((software_win_recency_ts is not null or
  5            software_mac_recency_ts is not null) and
  6           product1_recency_ts is null) or
  7          ((software_win_recency_ts - product1_recency_ts) > 90 or
  8           (software_mac_recency_ts - product1_recency_ts) > 90));

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 4293700422

--------------------------------------------------------------------------------
| Id  | Operation         | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            |     2 |    42 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| COLUMN_TAB |     2 |    42 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("PRODUCT2_RECENCY_TS" IS NULL AND ("PRODUCT1_RECENCY_TS"
              IS NULL AND ("SOFTWARE_MAC_RECENCY_TS" IS NOT NULL OR
              "SOFTWARE_WIN_RECENCY_TS" IS NOT NULL) OR
              "SOFTWARE_MAC_RECENCY_TS"-"PRODUCT1_RECENCY_TS">90 OR
              "SOFTWARE_WIN_RECENCY_TS"-"PRODUCT1_RECENCY_TS">90))

As you can see, the query construct to answer the business question is straight forward and requires just one pass over the table.

Version 2: The Row Based (Unpivot) Table, Take 1

In Version 2, there is a single row (tuple) which tracks the customer, product and the recency date. Unlike Version 1, none of the columns can be NULL.

SQL> desc row_tab
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUSTOMER_ID                               NOT NULL NUMBER
 RECENCY_TS                                NOT NULL DATE
 PRODUCT                                   NOT NULL VARCHAR2(32)

SQL> select * from row_tab;

CUSTOMER_ID RECENCY_TS PRODUCT
----------- ---------- --------------------------------
        100 2009-03-17 SOFTWARE_MAC
        200 2009-03-17 SOFTWARE_MAC
        300 2009-03-17 SOFTWARE_MAC
        500 2009-03-17 SOFTWARE_MAC
        400 2009-03-17 SOFTWARE_WIN
        100 2008-11-17 PRODUCT1
        200 2009-01-16 PRODUCT1
        300 2008-10-08 PRODUCT1
        400 2008-11-07 PRODUCT1
        300 2009-02-25 PRODUCT2

10 rows selected.

SQL> select a.customer_id
  2  from   row_tab a,
  3         (select customer_id,
  4                 product,
  5                 recency_ts
  6          from   row_tab
  7          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
  8  where  a.customer_id not in (select customer_id
  9                               from   row_tab
 10                               where  product in ('PRODUCT1', 'PRODUCT2')) and
 11         a.customer_id = b.customer_id
 12  union
 13  select a.customer_id
 14  from   row_tab a,
 15         (select customer_id,
 16                 product,
 17                 recency_ts
 18          from   row_tab
 19          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
 20  where  a.customer_id not in (select customer_id
 21                               from   row_tab
 22                               where  product = 'PRODUCT2') and
 23         a.customer_id = b.customer_id and
 24         (a.product = 'PRODUCT1' and
 25          b.recency_ts - a.recency_ts > 90);

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3517586312

---------------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |    11 |   368 |    22  (60)| 00:00:01 |
|   1 |  SORT UNIQUE          |         |    11 |   368 |    22  (60)| 00:00:01 |
|   2 |   UNION-ALL           |         |       |       |            |          |
|*  3 |    HASH JOIN ANTI     |         |    10 |   310 |    10  (10)| 00:00:01 |
|*  4 |     HASH JOIN         |         |    11 |   187 |     7  (15)| 00:00:01 |
|*  5 |      TABLE ACCESS FULL| ROW_TAB |     5 |    70 |     3   (0)| 00:00:01 |
|   6 |      TABLE ACCESS FULL| ROW_TAB |    10 |    30 |     3   (0)| 00:00:01 |
|*  7 |     TABLE ACCESS FULL | ROW_TAB |     5 |    70 |     3   (0)| 00:00:01 |
|*  8 |    HASH JOIN ANTI     |         |     1 |    58 |    10  (10)| 00:00:01 |
|*  9 |     HASH JOIN         |         |     1 |    44 |     7  (15)| 00:00:01 |
|* 10 |      TABLE ACCESS FULL| ROW_TAB |     4 |    88 |     3   (0)| 00:00:01 |
|* 11 |      TABLE ACCESS FULL| ROW_TAB |     5 |   110 |     3   (0)| 00:00:01 |
|* 12 |     TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   4 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   5 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
   7 - filter("PRODUCT"='PRODUCT1' OR "PRODUCT"='PRODUCT2')
   8 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   9 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
       filter("RECENCY_TS"-"A"."RECENCY_TS">90)
  10 - filter("A"."PRODUCT"='PRODUCT1')
  11 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
  12 - filter("PRODUCT"='PRODUCT2')

Version 2, Take 2

The way the query is written in Version 2, Take 1, it requires six accesses to the table. Partly this is because it uses a UNION. In this case the UNION can be removed and replaced with an OR branch.

SQL> select a.customer_id
  2  from   row_tab a,
  3         (select customer_id,
  4                 product,
  5                 recency_ts
  6          from   row_tab
  7          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
  8  where  a.customer_id = b.customer_id and
  9         ((a.customer_id not in (select customer_id
 10                               from   row_tab
 11                               where  product in ('PRODUCT1', 'PRODUCT2')))
 12         or
 13         ((a.customer_id not in (select customer_id
 14                               from   row_tab
 15                               where  product = 'PRODUCT2') and
 16         (a.product = 'PRODUCT1' and
 17          b.recency_ts - a.recency_ts > 90))))
 18  /

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3327813549

-------------------------------------------------------------------------------
| Id  | Operation           | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |         |     1 |    44 |     7  (15)| 00:00:01 |
|*  1 |  FILTER             |         |       |       |            |          |
|*  2 |   HASH JOIN         |         |    11 |   484 |     7  (15)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| ROW_TAB |     5 |   110 |     3   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
|*  5 |   TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
|*  6 |   TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( NOT EXISTS (SELECT 0 FROM "ROW_TAB" "ROW_TAB" WHERE
              "CUSTOMER_ID"=:B1 AND ("PRODUCT"='PRODUCT1' OR "PRODUCT"='PRODUCT2'))
              OR  NOT EXISTS (SELECT 0 FROM "ROW_TAB" "ROW_TAB" WHERE
              "PRODUCT"='PRODUCT2' AND "CUSTOMER_ID"=:B2) AND
              "A"."PRODUCT"='PRODUCT1' AND "RECENCY_TS"-"A"."RECENCY_TS">90)
   2 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   3 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
   5 - filter("CUSTOMER_ID"=:B1 AND ("PRODUCT"='PRODUCT1' OR
              "PRODUCT"='PRODUCT2'))
   6 - filter("PRODUCT"='PRODUCT2' AND "CUSTOMER_ID"=:B1)

This rewrite brings the table accesses down to four from six, so progress is being made, but I think we can do even better.

Version 2, Take 3

SQL is a very powerful language and there is usually more than one way to structure a query. Version 2, Take 1 uses a very literal translation of the business question and Take 2 just does a mild rewrite changing the UNION to an OR. In Version 2, Take 3, I am going to leverage some different, but very powerful functionality to yield the same results.

SQL> -- COLUMN_TAB can be expressed using ROW_TAB with MAX + CASE WHEN + GROUP BY:
SQL> select   customer_id,
  2           max (case
  3                   when product = 'SOFTWARE_MAC'
  4                      then recency_ts
  5                end) software_mac_recency_ts,
  6           max (case
  7                   when product = 'SOFTWARE_WIN'
  8                      then recency_ts
  9                end) software_win_recency_ts,
 10           max (case
 11                   when product = 'PRODUCT1'
 12                      then recency_ts
 13                end) product1_recency_ts,
 14           max (case
 15                   when product = 'PRODUCT2'
 16                      then recency_ts
 17                end) product2_recency_ts
 18  from     row_tab
 19  group by customer_id;

CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
----------- ---------- ---------- ---------- ----------
        100 2009-03-17            2008-11-17
        200 2009-03-17            2009-01-16
        300 2009-03-17            2008-10-08 2009-02-25
        400            2009-03-17 2008-11-07
        500 2009-03-17

5 rows selected.

SQL> -- The original query can be expressed as follows:
SQL> select customer_id
  2  from   (select   customer_id,
  3                   max (case
  4                           when product = 'SOFTWARE_MAC'
  5                              then recency_ts
  6                        end) software_mac_recency_ts,
  7                   max (case
  8                           when product = 'SOFTWARE_WIN'
  9                              then recency_ts
 10                        end) software_win_recency_ts,
 11                   max (case
 12                           when product = 'PRODUCT1'
 13                              then recency_ts
 14                        end) product1_recency_ts,
 15                   max (case
 16                           when product = 'PRODUCT2'
 17                              then recency_ts
 18                        end) product2_recency_ts
 19          from     row_tab
 20          group by customer_id)
 21  where  product2_recency_ts is null and
 22         (((software_win_recency_ts is not null or
 23            software_mac_recency_ts is not null) and
 24           product1_recency_ts is null) or
 25          ((software_win_recency_ts - product1_recency_ts) > 90 or
 26           (software_mac_recency_ts - product1_recency_ts) > 90)
 27         );

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 825621652

-------------------------------------------------------------------------------
| Id  | Operation           | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |         |     1 |    22 |     4  (25)| 00:00:01 |
|*  1 |  FILTER             |         |       |       |            |          |
|   2 |   HASH GROUP BY     |         |     1 |    22 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(MAX(CASE "PRODUCT" WHEN 'PRODUCT2' THEN "RECENCY_TS" END
              ) IS NULL AND ((MAX(CASE "PRODUCT" WHEN 'SOFTWARE_WIN' THEN
              "RECENCY_TS" END ) IS NOT NULL OR MAX(CASE "PRODUCT" WHEN
              'SOFTWARE_MAC' THEN "RECENCY_TS" END ) IS NOT NULL) AND MAX(CASE
              "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END ) IS NULL OR MAX(CASE
              "PRODUCT" WHEN 'SOFTWARE_WIN' THEN "RECENCY_TS" END )-MAX(CASE
              "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END )>90 OR MAX(CASE
              "PRODUCT" WHEN 'SOFTWARE_MAC' THEN "RECENCY_TS" END )-MAX(CASE
              "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END )>90))

Rewriting the query as a CASE WHEN with a GROUP BY not only cleaned up the SQL, it also resulted in a single pass over the table. Version 2, Take 3 reduces the table access from four to one!

Version 2, Take 4: The PIVOT operator in 11g

In 11g the PIVOT operator was introduced and can simplify the query even more.

SQL> -- In 11g the PIVOT operator can be used, so COLUMN_TAB can be expressed as:
SQL> select *
  2  from row_tab
  3  pivot (max(recency_ts) for product in
  4         ('SOFTWARE_MAC' as software_mac_recency_ts,
  5          'SOFTWARE_WIN' as software_win_recency_ts,
  6          'PRODUCT1' as product1_recency_ts,
  7          'PRODUCT2' as product2_recency_ts));

CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
----------- ---------- ---------- ---------- ----------
        100 2009-03-17            2008-11-17
        200 2009-03-17            2009-01-16
        300 2009-03-17            2008-10-08 2009-02-25
        400            2009-03-17 2008-11-07
        500 2009-03-17

5 rows selected.

SQL> -- Using PIVOT the original query can be expressed as:
SQL> select customer_id
  2  from   row_tab
  3  pivot  (max(recency_ts) for product in
  4         ('SOFTWARE_MAC' as software_mac_recency_ts,
  5          'SOFTWARE_WIN' as software_win_recency_ts,
  6          'PRODUCT1' as product1_recency_ts,
  7          'PRODUCT2' as product2_recency_ts))
  8  where  product2_recency_ts is null and
  9         (((software_win_recency_ts is not null or
 10            software_mac_recency_ts is not null) and
 11           product1_recency_ts is null) or
 12          ((software_win_recency_ts - product1_recency_ts) > 90 or
 13           (software_mac_recency_ts - product1_recency_ts) > 90)
 14         );

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3127820873

--------------------------------------------------------------------------------
| Id  | Operation            | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |         |     1 |    22 |     4  (25)| 00:00:01 |
|*  1 |  FILTER              |         |       |       |            |          |
|   2 |   HASH GROUP BY PIVOT|         |     1 |    22 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS FULL | ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(MAX(CASE  WHEN ("PRODUCT"='PRODUCT2') THEN "RECENCY_TS"
              END ) IS NULL AND ((MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_WIN') THEN
              "RECENCY_TS" END ) IS NOT NULL OR MAX(CASE  WHEN
              ("PRODUCT"='SOFTWARE_MAC') THEN "RECENCY_TS" END ) IS NOT NULL) AND
              MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END ) IS NULL
              OR MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_WIN') THEN "RECENCY_TS" END
              )-MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END )>90 OR
              MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_MAC') THEN "RECENCY_TS" END
              )-MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END )>90))

The Big Picture

One thing that I did not touch on is the flexibility of the ROW_TAB design when it comes to evolution. Any number of products can be added without making any modifications to the loading process. In order to do this with the COLUMN_TAB a new column must be added for each new product. The other major difference between the two table designs is that ROW_TAB is insert only while COLUMN_TAB must be updated if the customer exists. Generally one wants to avoid updated in a data warehouse as 1) old data is usually over written and 2) updates are more expensive than inserts.

The other major thing I won’t discuss in detail is how to partition or index (if required) COLUMN_TAB. Think about this. With ROW_TAB it is very straight forward.

Summary

There are many ways to design tables and write queries. Some of them work well, some do not. Some appear impossible at first, only to appear more simple later. Literal translation of a business question into SQL is usually far from optimal. One needs to think about the question being asked, the shape of the data, and the options available to solve that problem as well as the trade offs of those solutions. Remember: table definitions do not have to look like Spreadsheets. Generally only the output of a query needs to.

Don’t get stuck in SQL-92. It is the year 2009. You should be writing your SQL using the constructs that are provided. Often times very complex data transformations can be done with just SQL. Leverage this power.

All experiments performed on 11.1.0.7

The Best Benchmarketing I've Seen Yet: Measure BI Queries In Milliseconds

After posting about how ridiculous some of the benchmarketing claims that database vendors are making, Dave Menninger, VP of Marketing & Product Management at Vertica posted a comment that one of their customers reported a 40,400x gain in one query (this of course is after I openly joked about the 16,200x Vertica claim). So I made my way over to check out this claim, and sure enough, someone reported this. Here is the table presented in the webcast:

hMetrix_Vertica.png

To this database performance engineer, this yet another unimpressive performance claim, but rather a very creative use of numbers, or maybe better put, a good case of bad math. Or better yet, big fun with small numbers. Honestly, measuring a BI query response time in milliseconds?!?! I don’t even know if OLTP database users measure their query response time in milliseconds. I simply can’t stop laughing at the fact that there needs to be precision below 1 second. Obviously BI users could not possibly tell that their query ran in less than 1 second because the network latency would mask this. Not only that, it seems there were 154 queries to choose from and the Vertica marketing crew chose to mention this one. Brilliant I say. So yes Dave, this is even more ludicrous than the 16,200x claim. At best it is a 202x gain. You won’t get credit from me (and probably others) for fractional seconds, but thanks for mentioning it. It was a good chuckle. By the way, why add two extra places of precision for this query and not all the others?

I think it is also worth mentioning that the data set size for this case is 84GB (raw) and 10.5GB in the Vertica DB (8x compression). Given the server running the database has 32GB of RAM it easily classifies as an in-memory database, so response time should certainly be in the seconds. I don’t know about you, but performance claims on a database in which the uncompressed data fits on an Apple iPod don’t excite me.

Dave Menninger also mentions:

One other piece of information in an effort of full (or at least more) disclosure is the following blog post that breaks down the orders of magnitude differences between row stores and column stores to their constituent parts.
Debunking Yet Another Myth: Column-Stores As A Storage-Layer Only Optimization

Column stores have been a topic of many research papers. The one that has caught my attention most recently is the paper by Allison Holloway and David DeWitt (Go Badgers!) entitled Read-Optimized Databases, In Depth and the VLDB 2008 presentation which has an alternate title of Yet Another Row Store vs Column Store Paper. I might suggest that you give them a read. Perhaps the crew at The Database Column will offer some comments on Allison and David’s research. I’m surprised that they haven’t already.

Well, that’s enough fun for a Friday. Time to kick off some benchmark queries on my HP Oracle Database Machine.

Database Customer Benchmarketing Reports

A few weeks ago I read Curt Monash’s report on interpreting the results of data warehouse proofs-of-concept (POCs) and I have to say, I’m quite surprised that this topic hasn’t been covered more by analysts in the data warehousing space. I understand that analysts are not database performance engineers, but where do they think that the performance claims of 10x to 100x or more come from? Do they actually investigate these claims or just report on them? I can not say that I have ever seen any database analyst offer any technical insight into these boasts of performance. If some exist be sure to leave a comment and point me to them.

Oracle Exadata Performance Architect Kevin Closson has blogged about a 485x performance increase of Oracle Exadata vs. Oracle Exadata and his follow-up post to explain exactly where the 485x performance gain comes from gave me the nudge to finish this post that had been sitting in my drafts folder since I first read Curt’s post.

Customer Bechmarketing Claims

I thought I would compile a list of what the marketing folks at other database vendors are saying about the performance of their products. Each of these statements have been taken from the given vendor’s website.

  • Netezza: 10-100 times faster than traditional solutions…but it is not uncommon to see performance differences as large as 200x to even 400x or more when compared to existing Oracle systems
  • Greenplum: often 10 to 100 times faster than traditional solutions
  • DATAllegro: 10-100x performance over traditional platforms
  • Vertica: Performs 30x-200x faster than other solutions
  • ParAccel: 20X – 200X performance gains
  • EXASolution: can perform up to 100 times faster than with traditional databases
  • Kognitio WX2: Tests have shown to out-perform other database / data warehouse solutions by 10-60 times

Certainly seems these vendors are a positioning themselves against traditional database solutions, whatever that means. And differences as large as 400x against Oracle? What is it exactly they are comparing?

Investigative Research On Netezza’s Performance Claims

Using my favorite Internet search engine I came across this presentation by Netezza dated October 2007. On slide 21 Netezza is comparing an NPS 8150 (112 SPU, up to 4.5 TB of user data) server to IBM DB2 UDB on a p680 with 12 CPUs (the existing solution). Not being extremely familiar with the IBM hardware mentioned, I thought I’d research to see exactly what an IBM p680 server consists of. The first link in my search results took me to here where the web page states:

The IBM eServer pSeries 680 has been withdrawn from the market, effective March 28, 2003.

Searching a bit more I came across this page which states that the 12 CPUs in the pSeries 680 are RS64 IV microprocessors. According to Wikipedia the “RS64-IV or Sstar was introduced in 2000 at 600 MHz, later increased to 750 MHz”. Given that at best, the p680 had 12 CPUs running at 750 MHz and the NPS 8150 had 112 440GX PowerPC processors I would give the compute advantage to Netezza by a significant margin. I guess it is cool to brag how your most current hardware beat up on some old used and abused server who has already been served its end-of-life notice. I found it especially intriguing that Netezza is boasting about beating out an IBM p680 server that has been end-of-lifed more than four years prior to the presentation’s date. Perhaps they don’t have any more recent bragging to do?

Going back one slide to #20 you will notice a comparison of Netezza and Oracle. Netezza clearly states they used a NPS 8250 (224 SPUs, up to 9 TB of user data) against Oracle 10g RAC running on Sun/EMC. Well ok…Sun/EMC what??? Obviously there were at least 2 Sun servers, since Oracle 10g RAC is involved, but they don’t mention the server models at all, nor the storage, nor the storage connectivity to the hosts. Was this two or more Sun Netra X1s or what??? Netezza boasts a 449x improvement in a “direct comparison on one day’s worth of data”. What exactly is being compared is up to the imagination. I guess this could be one query or many queries, but the marketeers intentionally fail to mention. They don’t even mention the data set size being compared. Given that Netezza can read data off the 224 drives at 60-70 MB/s, the NPS 8250 has a total scan rate of over 13 GB/s. I can tell you first hand that there are very few Sun/EMC solutions that are configured to support 13 GB/s of I/O bandwidth. Most configurations of that vintage probably don’t support 1/10th of that I/O bandwidth (1.3 GB/s).

Here are a few more comparisons that I have seen in Netezza presentations:

  • NPS 8100 (112 SPUs/4.5 TB max) vs. SAS on Sun E5500/6 CPUs/6GB RAM
  • NPS 8100 (112 SPUs/4.5 TB max) vs. Oracle 8i on Sun E6500/12 CPUs/8 GB RAM
  • NPS 8400 (448 SPUs/18 TB max) vs. Oracle on Sun (exact hardware not mentioned)
  • NPS 8100 (112 SPUs/4.5 TB max) vs. IBM SP2 (database not mentioned)
  • NPS 8150z (112 SPUs/5.5 TB max) vs. Oracle 9i on Sun/8 CPUs
  • NPS 8250z (224 SPUs/11 TB max) vs. Oracle 9i on Sun/8 CPUs

As you can see, Netezza has a way of finding the oldest hardware around and then comparing it to its latest, greatest NPS. Just like Netezza slogan, [The Power to ]Question Everything™, I suggest you question these benchmarketing reports. Database software is only as capable as the hardware it runs on and when Netezza targets the worst performing and oldest systems out there, they are bound to get some good marketing numbers. If they compete against the latest, greatest database software running on the latest, greatest hardware, sized competitively for the NPS being used, the results are drastically different. I can vouch for that one first hand having done several POCs against Netezza.

One Benchmarketing Claim To Rule Them All

Now, one of my favorite benchmarketing reports is one from Vertica. Michael Stonebraker’s blog post on customer benchmarks contains the following table:

vertica_benchmark_table.png

Take a good look at the Query 2 results. Vertica takes a query running in the current row store from running in 4.5 hours (16,200 seconds) to 1 second for a performance gain of 16,200x. Great googly moogly batman, that is reaching ludicrous speed. Heck, who needs 100x or 400x when you do 16,200x. That surely warrants an explanation of the techniques involved there. It’s much, much more than simply column store vs. row store. It does raise the question (at least to me): why Vertica doesn’t run every query in 1 second. I mean, come on, why doesn’t that 19 minute row store query score better than a 30x gain? Obviously there is a bit of the magic pixie dust going on here with, what I would refer to as “creative solutions” (in reality it is likely just a very well designed projection/materaizied view, but by showing the query and telling us how it was possible would make it less unimpressive [sic]).

What Is Really Going On Here

First of all, you will notice that not one of these benchmarketing claims is against a vendor run system. Each and every one of these claims are against existing customer systems. The main reason for this is that most vendors prohibit benchmark results being published with out prior consent from the vendor in the licensing agreement. Seems the creative types have found that taking the numbers from the existing, production system is not prohibited in the license agreement so they compare that to their latest, greatest hardware/software and execute or supervise the execution of a benchmark on their solution. Obviously this is a one sided apples to bicycles comparison, but quite favorable for bragging rights for the new guy.

I’ve been doing customer benchmarks and proof of concepts (POCs) for almost 5 years at Oracle. I can guarantee you that Netezza has never even come close to getting 10x-100x the performance over Oracle running on a competitive hardware platform. Now I can say that it is not uncommon for Oracle running on a balanced system to perform 10x to 1000x (ok, in extreme cases) over an existing poorly performing Oracle system. All it takes is to have a very unbalanced system with no I/O bandwidth, not be using parallel query, not use compression, poor or no use of partitioning and you have created a springboard for any vendor to look good.

One More Juicy Marketing Tidbit

While searching the Internet for creative marketing reports I have to admit that the crew at ParAccel probably takes the cake (and not in an impressive way). On one of their web pages they have these bullet points (plus a few more uninteresting ones):

  • All operations are done in parallel (A non-parallel DBMS must scan all of the data sequentially)
  • Adaptive compression makes disks faster…

Ok, so I can kinda, sorta see the point that a non-parallel DBMS must do something sequentially…not sure how else it would do it, but then again, I don’t know any enterprise database that is not capable of parallel operations. However, I’m going to need a bit of help on the second point there…how exactly does compression make disks faster? Disks are disks. Whether or not compression is involved has nothing to do with how fast a disk is. Perhaps they mean that compression can increase the logical read rate from a disk given that compression allows more data to be stored in the same “space” on the disk, but that clearly is not what they have written. Reminds me of DATAllegro’s faster-than-wirespeed claims on scan performance. Perhaps these marketing guys should have their numbers and wording validated by some engineers.

Do You Believe In Magic Or Word Games?

Creditable performance claims need to be accounted for and explained. Neil Raden from Hired Brains Research offers guidance for evaluating benchmarks and interpreting market messaging in his paper, Questions to Ask a Data Warehouse Appliance Vendor. I think Neil shares the same opinion of these silly benchmarketing claims. Give his paper a read.

DBMS_STATS, METHOD_OPT and FOR ALL INDEXED COLUMNS

I’ve written before on choosing an optimal stats gathering strategy but I recently came across a scenario that I didn’t directly blog about and think it deserves attention. As I mentioned in that previous post, one should only deviate from the defaults when they have a reason to, and fully understand that reason and the effect of that decision.

Understanding METHOD_OPT

The METHOD_OPT parameter of DBMS_STATS controls two things:

  1. on which columns statistics will be collected
  2. on which columns histograms will be collected (and how many buckets)

It is very important to understand #1 and how the choice of METHOD_OPT effects the collection of column statistics.

Prerequisite: Where Do I Find Column Statistics?

Understanding where to find column statistics is vital for troubleshooting bad execution plans. These views will be the arrows in your quiver:

  • USER_TAB_COL_STATISTICS
  • USER_PART_COL_STATISTICS
  • USER_SUBPART_COL_STATISTICS

Depending on if the table is partitioned or subpartitioned, and depending on what GRANULARITY the stats were gathered with, the latter two of those views may or may not be populated.

The Bane of METHOD_OPT: FOR ALL INDEXED COLUMNS

If you are using FOR ALL INDEXED COLUMNS as part of your METHOD_OPT you probably should not be. Allow me to explain. Using MENTOD_OPT=>'FOR ALL INDEXED COLUMNS SIZE AUTO' (a common METHOD_OPT I see) tells DBMS_STATS: “only gather stats on columns that participate in an index and based on data distribution and the workload of those indexed columns decide if a histogram should be created and how many buckets it should contain“. Is that really what you want? My guess is probably not. Let me work through a few examples to explain why.

I’m going to start with this table.

SQL> exec dbms_random.initialize(1);

PL/SQL procedure successfully completed.

SQL> create table t1
  2  as
  3  select
  4    column_value                    pk,
  5    round(dbms_random.value(1,2))   a,
  6    round(dbms_random.value(1,5))   b,
  7    round(dbms_random.value(1,10))  c,
  8    round(dbms_random.value(1,100)) d,
  9    round(dbms_random.value(1,100)) e
 10  from table(counter(1,1000000))
 11  /

Table created.

SQL> begin
  2    dbms_stats.gather_table_stats(
  3      ownname => user ,
  4      tabname => 'T1' ,
  5      estimate_percent => 100 ,
  6      cascade => true);
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL> select
  2    COLUMN_NAME, NUM_DISTINCT, HISTOGRAM, NUM_BUCKETS,
  3    to_char(LAST_ANALYZED,'yyyy-dd-mm hh24:mi:ss') LAST_ANALYZED
  4  from user_tab_col_statistics
  5  where table_name='T1'
  6  /

COLUMN_NAME NUM_DISTINCT HISTOGRAM       NUM_BUCKETS LAST_ANALYZED
----------- ------------ --------------- ----------- -------------------
PK               1000000 NONE                      1 2008-13-10 18:39:51
A                      2 NONE                      1 2008-13-10 18:39:51
B                      5 NONE                      1 2008-13-10 18:39:51
C                     10 NONE                      1 2008-13-10 18:39:51
D                    100 NONE                      1 2008-13-10 18:39:51
E                    100 NONE                      1 2008-13-10 18:39:51

6 rows selected.

This 6 column table contains 1,000,000 rows of randomly generated numbers. I’ve queried USER_TAB_COL_STATISTICS to display some of the important attributes (NDV, Histogram, Number of Buckets, etc).

I’m going to now put an index on T1(PK), delete the stats and recollect stats using two different METHOD_OPT parameters that each use 'FOR ALL INDEXED COLUMNS'.

SQL> create unique index PK_T1 on T1(PK);

Index created.

SQL> begin
  2    dbms_stats.delete_table_stats(user,'T1');
  3
  4    dbms_stats.gather_table_stats(
  5      ownname => user ,
  6      tabname => 'T1' ,
  7      estimate_percent => 100 ,
  8      method_opt => 'for all indexed columns' ,
  9      cascade => true);
 10  end;
 11  /

PL/SQL procedure successfully completed.

SQL> select COLUMN_NAME, NUM_DISTINCT, HISTOGRAM, NUM_BUCKETS,
  2  to_char(LAST_ANALYZED,'yyyy-dd-mm hh24:mi:ss') LAST_ANALYZED
  3  from user_tab_col_statistics
  4  where table_name='T1'
  5  /

COLUMN_NAME NUM_DISTINCT HISTOGRAM       NUM_BUCKETS LAST_ANALYZED
----------- ------------ --------------- ----------- -------------------
PK               1000000 HEIGHT BALANCED          75 2008-13-10 18:41:10

SQL> begin
  2    dbms_stats.delete_table_stats(user,'T1');
  3
  4    dbms_stats.gather_table_stats(
  5      ownname => user ,
  6      tabname => 'T1' ,
  7      estimate_percent => 100 ,
  8      method_opt => 'for all indexed columns size auto' ,
  9      cascade => true);
 10  end;
 11  /

PL/SQL procedure successfully completed.

SQL> select COLUMN_NAME, NUM_DISTINCT, HISTOGRAM, NUM_BUCKETS,
  2  to_char(LAST_ANALYZED,'yyyy-dd-mm hh24:mi:ss') LAST_ANALYZED
  3  from user_tab_col_statistics
  4  where table_name='T1'
  5  /

COLUMN_NAME NUM_DISTINCT HISTOGRAM       NUM_BUCKETS LAST_ANALYZED
----------- ------------ --------------- ----------- -------------------
PK               1000000 NONE                      1 2008-13-10 18:41:12

Notice that in both cases only column PK has stats on it. Columns A,B,C,D and E do not have any stats collected on them. Also note that when no SIZE clause is specified, it defaults to 75 buckets.

Now one might think that is no big deal or perhaps they do not realize this is happening because they do not look at their stats. Let’s see what we get for cardinality estimates from the Optimizer for a few scenarios.

SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5  /

  COUNT(*)
----------
    500227

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  4df0g0r99zmba, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.24 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |  10000 |    500K|00:00:00.50 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("A"=1)

Notice the E-Rows estimate for T1. The Optimizer is estimating 10,000 rows when in reality there is 500,227. The estimate is off by more than an order of magnitude (50x). Normally the calculation for the cardinality would be (for a one table single equality predicate):
number of rows in T1 * 1/NDV = 1,000,000 * 1/2 = 500,000
but in this case 10,000 is the estimate. Strangely enough (or not), 10,000 is exactly 0.01 (1%) of 1,000,000. Because there are no column stats for T1.A, the Optimizer is forced to make a guess, and that guess is 1%.

As you can see from the 10053 trace (below), since there are no statistics on the column, defaults are used. In this case they yield very poor cardinality estimations.

SINGLE TABLE ACCESS PATH
  -----------------------------------------
  BEGIN Single Table Cardinality Estimation
  -----------------------------------------
  Column (#2): A(NUMBER)  NO STATISTICS (using defaults)
    AvgLen: 13.00 NDV: 31250 Nulls: 0 Density: 3.2000e-05
  Table: T1  Alias: T1
    Card: Original: 1000000  Rounded: 10000  Computed: 10000.00  Non Adjusted: 10000.00
  -----------------------------------------
  END   Single Table Cardinality Estimation
  -----------------------------------------

Now that I’ve demonstrated how poor the cardinality estimation was with a single equality predicate, let’s see what two equality predicates gives us for a cardinality estimate.

SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6  /

  COUNT(*)
----------
    124724

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  ctq8q59qdymw6, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.19 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    100 |    124K|00:00:00.25 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("A"=1 AND "B"=3))

Yikes. In this case the cardinality estimate is 100 when the actual number of rows is 124,724, a difference of over 3 orders of magnitude (over 1000x). Where did the 100 row estimate come from? In this case there are two equality predicates so the selectivity is calculated as 1% * 1% or 0.01 * 0.01 = 0.0001. 1,000,000 * 0.0001 = 100. Funny that. (The 1% is the default selectivity for an equality predicate w/o stats.)

Now let’s add a derived predicate as well and check the estimates.

SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6    and d+e > 50
  7  /

  COUNT(*)
----------
    109816

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  5x200q9rqvvfu, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3
 and d+e > 50

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.22 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |      5 |    109K|00:00:00.33 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("A"=1 AND "B"=3 AND "D"+"E">50))

Doh! The cardinality estimate is now 5, but the actual number of rows being returned is 109,816. Not good at all. The Optimizer estimated 5 rows because it used a default selectivity of 1% (for A=1) * 1% (for B=3) * 5% (for D+E > 50) * 1,000,000 rows. Now can you see why column statistics are very important? All it takes is a few predicates and the cardinality estimation becomes very small, very fast. Now consider this:

  • What is likely to happen in a data warehouse where the queries are 5+ table joins and the fact table columns do not have indexes?
  • Would the Optimizer choose the correct driving table?
  • Would nested loops plans probably be chosen when it is really not appropriate?

Hopefully you can see where this is going. If you don’t, here is the all too common chain of events:

  • Non representative (or missing) statistics lead to
  • Poor cardinality estimates which leads to
  • Poor access path selection which leads to
  • Poor join method selection which leads to
  • Poor join order selection which leads to
  • Poor SQL execution times

Take 2: Using the Defaults

Now I’m going to recollect stats with a default METHOD_OPT and run through the 3 execution plans again:

SQL> begin
  2    dbms_stats.delete_table_stats(user,'t1');
  3
  4    dbms_stats.gather_table_stats(
  5      ownname => user ,
  6      tabname => 'T1' ,
  7      estimate_percent => 100 ,
  8      degree => 8,
  9      cascade => true);
 10  end;
 11  /

PL/SQL procedure successfully completed.

SQL> select column_name, num_distinct, histogram, NUM_BUCKETS,
  2  to_char(LAST_ANALYZED,'yyyy-dd-mm hh24:mi:ss') LAST_ANALYZED
  3  from user_tab_col_statistics where table_name='T1'
  4  /

COLUMN_NAME NUM_DISTINCT HISTOGRAM       NUM_BUCKETS LAST_ANALYZED
----------- ------------ --------------- ----------- -------------------
PK               1000000 NONE                      1 2008-13-10 19:44:32
A                      2 FREQUENCY                 2 2008-13-10 19:44:32
B                      5 FREQUENCY                 5 2008-13-10 19:44:32
C                     10 FREQUENCY                10 2008-13-10 19:44:32
D                    100 NONE                      1 2008-13-10 19:44:32
E                    100 NONE                      1 2008-13-10 19:44:32

6 rows selected.
SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5  /

  COUNT(*)
----------
    500227

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  4df0g0r99zmba, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.20 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    500K|    500K|00:00:00.50 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("A"=1)
SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6  /

  COUNT(*)
----------
    124724

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  ctq8q59qdymw6, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.14 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    124K|    124K|00:00:00.25 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("B"=3 AND "A"=1))
SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6    and d+e > 50
  7  /

  COUNT(*)
----------
    109816

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  5x200q9rqvvfu, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3
 and d+e>50

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.17 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |   6236 |    109K|00:00:00.22 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("B"=3 AND "A"=1 AND "D"+"E">50))

As you can see, the first two queries have spot on cardinality estimates, but the the third query isn’t as good as it uses a column combination and there are no stats on D+E columns, only D and E individually. I’m going to rerun the third query with dynamic sampling set to 4 (in 10g it defaults to 2) and reevaluate the cardinality estimate.

SQL> alter session set optimizer_dynamic_sampling=4;

Session altered.

SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6    and d+e > 50
  7  /

  COUNT(*)
----------
    109816

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  5x200q9rqvvfu, child number 1
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3
 and d+e > 50

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.17 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    102K|    109K|00:00:00.22 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("B"=3 AND "A"=1 AND "D"+"E">50))

Note
-----
   - dynamic sampling used for this statement

Bingo! Close enough to call statistically equivalent.

Summary

I hope this little exercise demonstrates how important it is to have representative statistics and that when statistics are representative the Optimizer can very often accurately estimate the cardinality and thus choose the best plan for the query. Remember these points:

  • Recent statistics do not necessarily equate to representative statistics.
  • Statistics are required on all columns to yield good plans – not just indexed columns.
  • You probably should not be using METHOD_OPT => 'FOR ALL INDEXED COLUMNS SIZE AUTO', especially in a data warehouse where indexes are used sparingly.
  • Dynamic Sampling can assist with cardinality estimates where existing stats are not enough.

Tests performed on 10.2.0.4

Automatic DB_FILE_MULTIBLOCK_READ_COUNT

Note: Originally this experiment was from a post I wrote on the Oracle Forum: Database – General. I recommend that you read Jonathan Lewis’ summarization of the thread instead of reading all 671 posts (as of today). You will spend much less time and get more out of the discussion.

One of the new features that was released in 10gR2 is the automatic DB_FILE_MULTIBLOCK_READ_COUNT. Below are portions from the documentation that describe this feature.

Oracle Database 10g New Features

The DB_FILE_MULTIBLOCK_READ_COUNT parameter controls the amount of block prefetching done in the buffer cache during scan operations, such as full table scan and index fast full scan. The value of this parameter can have a significant impact on the overall database performance. This feature enables Oracle Database to automatically select the appropriate value for this parameter depending on the operating system optimal I/O size and the size of the buffer cache.

This feature simplifies manageability by automating the tuning of DB_FILE_MULTIBLOCK_READ_COUNT initialization parameter.

Oracle Database Performance Tuning Guide

This parameter specifies the number of blocks that are read in a single I/O during a full table scan or index fast full scan. The optimizer uses the value of DB_FILE_MULTIBLOCK_READ_COUNT to cost full table scans and index fast full scans. Larger values result in a cheaper cost for full table scans and can result in the optimizer choosing a full table scan over an index scan. If this parameter is not set explicitly (or is set is 0), the optimizer will use a default value of 8 when costing full table scans and index fast full scans.

Be Aware of the Bug

Although the documentation states:

If this value is not set explicitly (or is set to 0)…

there is a bug (5768025) if one sets DB_FILE_MULTIBLOCK_READ_COUNT to 0. This will result in making all muti-block I/O requests 1 block (db file sequential read), thus completely disabling the advantage of DB_FILE_MULTIBLOCK_READ_COUNT. Be aware!!! My recommendation: just don’t set it if you want to enable it.

Read I/O Request Size

Currently, the maximum read I/O request size that Oracle can issue to the OS is 1 Megabyte (1MB). The equation for the maximum read I/O request size from the Oracle database is db_file_multiblock_read_count * db_block_size. For example, if you are using a db_block_size of 8192 (8k) and db_file_multiblock_read_count is set to 64 the maximum read size request would be 8192 * 64 = 524,288 bytes or 0.5MB. One could set db_file_multiblock_read_count = 128 to achieve a 1MB read size, but that is the absolute maximum possible.

The advantage of using the automatic DB_FILE_MULTIBLOCK_READ_COUNT is that the database can leverage the benefits of a large read I/O request size without over influencing the cost based optimizer toward full table scans.

The Experiment of Block Size and Automatic DB_FILE_MULTIBLOCK_READ_COUNT

The purpose of this experiment will be to provide metrics so we can answer the question:
Does block size have any impact on elapsed time for a FTS query with 100% physical I/Os when using the automatic DB_FILE_MULTIBLOCK_READ_COUNT?

The experiment:

  • 4 identical tables, with block sizes of 2k, 4k, 8k and 16k
  • DB_FILE_MULTIBLOCK_READ_COUNT will be unset, letting the Oracle database choose the best size
  • cold db cache so forcing 100% physical reads
  • ASM storage, so no file system cache
  • query will be: select * from table;

For the data in the table I’m going to use the WEB_RETURNS (SF=100GB) table from TPC-DS. The flat file is 1053529104 bytes (~1GB) as reported from the ls command.

-- tablespace create statements
create tablespace tpcds_8k  datafile '+GROUP1' size 1500m;
create tablespace tpcds_2k  datafile '+GROUP1' size 1500m blocksize 2k;
create tablespace tpcds_4k  datafile '+GROUP1' size 1500m blocksize 4k;
create tablespace tpcds_16k datafile '+GROUP1' size 1500m blocksize 16k;

-- table create statements
create table web_returns_8k  tablespace tpcds_8k  as select * from web_returns_et;
create table web_returns_2k  tablespace tpcds_2k  as select * from web_returns_et;
create table web_returns_4k  tablespace tpcds_4k  as select * from web_returns_et;
create table web_returns_16k tablespace tpcds_16k as select * from web_returns_et;

-- segment size
select segment_name, sum(bytes)/1024/1024 mb from user_segments group by segment_name;

SEGMENT_NAME                 MB
-------------------- ----------
WEB_RETURNS_2K              976
WEB_RETURNS_4K              920
WEB_RETURNS_8K              896
WEB_RETURNS_16K             880

SQL> desc WEB_RETURNS_16K
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 WR_RETURNED_DATE_SK                                NUMBER(38)
 WR_RETURNED_TIME_SK                                NUMBER(38)
 WR_ITEM_SK                                         NUMBER(38)
 WR_REFUNDED_CUSTOMER_SK                            NUMBER(38)
 WR_REFUNDED_CDEMO_SK                               NUMBER(38)
 WR_REFUNDED_HDEMO_SK                               NUMBER(38)
 WR_REFUNDED_ADDR_SK                                NUMBER(38)
 WR_RETURNING_CUSTOMER_SK                           NUMBER(38)
 WR_RETURNING_CDEMO_SK                              NUMBER(38)
 WR_RETURNING_HDEMO_SK                              NUMBER(38)
 WR_RETURNING_ADDR_SK                               NUMBER(38)
 WR_WEB_PAGE_SK                                     NUMBER(38)
 WR_REASON_SK                                       NUMBER(38)
 WR_ORDER_NUMBER                                    NUMBER(38)
 WR_RETURN_QUANTITY                                 NUMBER(38)
 WR_RETURN_AMT                                      NUMBER(7,2)
 WR_RETURN_TAX                                      NUMBER(7,2)
 WR_RETURN_AMT_INC_TAX                              NUMBER(7,2)
 WR_FEE                                             NUMBER(7,2)
 WR_RETURN_SHIP_COST                                NUMBER(7,2)
 WR_REFUNDED_CASH                                   NUMBER(7,2)
 WR_REVERSED_CHARGE                                 NUMBER(7,2)
 WR_ACCOUNT_CREDIT                                  NUMBER(7,2)
 WR_NET_LOSS                                        NUMBER(7,2)

I’m using a Pro*C program to execute each query and fetch the rows with an array size of 100. This way I don’t have to worry about spool space, or overhead of SQL*Plus formatting. I have 4 files that contain the queries for each of the 4 tables for each of the 4 block sizes.

Output from a run is such:

BEGIN_TIMESTAMP   QUERY_FILE                       ELAPSED_SECONDS ROW_COUNT
----------------- -------------------------------- --------------- ---------
20080604 22:22:19 2.sql                                 125.696083   7197670
20080604 22:24:25 4.sql                                 125.439680   7197670
20080604 22:26:30 8.sql                                 125.502804   7197670
20080604 22:28:36 16.sql                                125.251398   7197670

As you can see, no matter what the block size, the execution time is the same (discounting fractions of a second).

The TKPROF Output

Below is the TKPROF output from each of the 4 executions.

TKPROF: Release 11.1.0.6.0 - Production on Wed Jun 4 22:35:07 2008

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

Trace file: v11_ora_12162.trc
Sort options: default

********************************************************************************
count    = number of times OCI procedure was executed
cpu      = cpu time in seconds executing
elapsed  = elapsed time in seconds executing
disk     = number of physical reads of buffers from disk
query    = number of buffers gotten for consistent read
current  = number of buffers gotten in current mode (usually for update)
rows     = number of rows processed by the fetch or execute call
********************************************************************************

/* 2.sql */

select * from web_returns_2k



call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch    71978     25.39      26.42     493333     560355          0     7197670
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    71980     25.39      26.42     493333     560355          0     7197670

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 50

Rows     Row Source Operation
-------  ---------------------------------------------------
7197670  TABLE ACCESS FULL WEB_RETURNS_2K (cr=560355 pr=493333 pw=493333 time=88067 us cost=96149 size=770150690 card=7197670)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                   71980        0.00          0.16
  SQL*Net message from client                 71980        0.00         93.20
  db file sequential read                         3        0.00          0.01
  direct path read                             1097        0.04          0.13
  SQL*Net more data to client                 71976        0.00          1.88
********************************************************************************

/* 4.sql */
select * from web_returns_4k

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        2      0.00       0.00          0          0          0           0
Execute      2      0.00       0.03          0          0          0           0
Fetch    71978     24.98      25.92     232603     302309          0     7197670
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    71982     24.98      25.96     232603     302309          0     7197670

Misses in library cache during parse: 0
Parsing user id: 50

Rows     Row Source Operation
-------  ---------------------------------------------------
7197670  TABLE ACCESS FULL WEB_RETURNS_4K (cr=302309 pr=232603 pw=232603 time=84876 us cost=51644 size=770150690 card=7197670)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                   71981        0.00          0.15
  SQL*Net message from client                 71981        0.00         93.19
  db file sequential read                         2        0.00          0.01
  direct path read                             1034        0.02          0.19
  SQL*Net more data to client                 71976        0.00          1.85
  rdbms ipc reply                                 1        0.03          0.03
********************************************************************************

/* 8.sql */
select * from web_returns_8k

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        2      0.00       0.00          0          0          0           0
Execute      2      0.00       0.01          0          0          0           0
Fetch    71978     24.61      25.71     113157     183974          0     7197670
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    71982     24.61      25.73     113157     183974          0     7197670

Misses in library cache during parse: 0
Parsing user id: 50

Rows     Row Source Operation
-------  ---------------------------------------------------
7197670  TABLE ACCESS FULL WEB_RETURNS_8K (cr=183974 pr=113157 pw=113157 time=85549 us cost=31263 size=770150690 card=7197670)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                   71981        0.00          0.15
  SQL*Net message from client                 71981        0.00         93.32
  db file sequential read                         1        0.01          0.01
  direct path read                              999        0.01          0.17
  SQL*Net more data to client                 71976        0.00          1.83
  rdbms ipc reply                                 1        0.01          0.01
********************************************************************************

/* 16.sql */
select * from web_returns_16k

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch    71978     24.74      25.59      55822     127217          0     7197670
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    71980     24.74      25.59      55822     127217          0     7197670

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 50

Rows     Row Source Operation
-------  ---------------------------------------------------
7197670  TABLE ACCESS FULL WEB_RETURNS_16K (cr=127217 pr=55822 pw=55822 time=82996 us cost=21480 size=770150690 card=7197670)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                   71980        0.00          0.15
  SQL*Net message from client                 71980        0.00         93.39
  db file sequential read                         1        0.00          0.00
  direct path read                              981        0.01          0.16
  SQL*Net more data to client                 71976        0.00          1.84
********************************************************************************

Raw Trace File Metrics

select FILE_ID,TABLESPACE_NAME from dba_data_files where TABLESPACE_NAME like 'TPC%'

   FILE_ID TABLESPACE_NAME
---------- ---------------
	16 TPCDS_8K
	17 TPCDS_2K
	18 TPCDS_4K
	19 TPCDS_16K

2k: WAIT #2: nam='direct path read' ela= 37 file number=17 first dba=33280 block cnt=512 obj#=55839 tim=1212643347820647
4k: WAIT #2: nam='direct path read' ela= 33 file number=18 first dba=16640 block cnt=256 obj#=55840 tim=1212643474070675
8k: WAIT #1: nam='direct path read' ela= 30 file number=16 first dba=8320  block cnt=128 obj#=55838 tim=1212643599631927
16k:WAIT #2: nam='direct path read' ela= 39 file number=19 first dba=55040 block cnt=64  obj#=55841 tim=1212643838893785

The raw trace file shows us that for each block size the reads are optimized to 1MB. For example, with a 2k block, 512 blocks are read at a time. The cnt= is the number of blocks read with a single multi-block read.

Block Size
MBRC
I/O Size
2,048
512
1MB
4,096
256
1MB
8,192
128
1MB
16,384
64
1MB

So What Does This Experiment Demonstrate?

When using the automatic DB_FILE_MULTIBLOCK_READ_COUNT, it actually is not the blocksize that really matters, but the I/O request size. More importantly, the Oracle database can decide the optimal MBRC no matter what the blocksize, demonstrating there is no advantage to a larger (or even smaller) blocksize in this case.

Think of it like this: If I grab $100 from a bucket of coins given these rules:

  • with each grab, exactly $1 is retrieved
  • the same denomination of coin is always retrieved for a given “run”
  • the time to complete the task is only related to the number of grabs, not the number of coins obtained

Regardless of the denomination of the coins grabbed, I need to grab 100 times. I could grab 4 quarters, or 10 dimes or 20 nickels or 100 pennies and each grab “performs” the same.

Oracle 11g: Incremental Global Statistics On Partitioned Tables

Previously I blogged about the new and improved DBMS_STATS.AUTO_SAMPLE_SIZE used to calculate NDV in Oracle 11g and now I wanted to touch on another new feature of DBMS_STATS in 11g: Incremental Global Statistics On Partitioned Tables.

Before Incremental Global Stats (Two-Pass Method)

When DBMS_STATS.GATHER_TABLE_STATS collects statistics on a partitioned table, generally it does so at the partition and table (global) level (the default behavior can be modified by changing the GRANULARITY parameter). This is done in two steps. First, partition level stats are gathered by scanning the partition(s) that have stale or empty stats, then a full table scan is executed to gather the global statistics. As more partitions are added to a given table, the longer the execution time for GATHER_TABLE_STATS, due to the full table scan requited for global stats.

Using Incremental Global Stats (Synopsis-Based Method)

Incremental Global Stats works by collecting stats on partitions and storing a synopsis which is the statistics metadata for that partition and the columns for that partition. This synopsis is stored in the SYSAUX tablespace, but is quite small (only a few kilobytes). Global stats are then created not by reading the entire table, but by aggregating the synopses from each partition. Incremental Global Stats, in conjunction with the new 11g DBMS_STATS.AUTO_SAMPLE_SIZE, yield a significant reduction in the time to collect statistics and produce near perfect accuracy.

Turning On Incremental Global Stats

Incremental Global Stats can only be used for partitioned tables and is activated by this command:

SQL> exec DBMS_STATS.SET_TABLE_PREFS(user,'FOO','INCREMENTAL','TRUE')

-- To see the value for INCREMENTAL for a given table:

SQL> select dbms_stats.get_prefs('INCREMENTAL', tabname=>'FOO') from dual;

DBMS_STATS.GET_PREFS('INCREMENTAL',TABNAME=>'FOO')
--------------------------------------------------
TRUE

You may also use any of the other DBMS_STATS.SET_*_PREFS as well.

A Real-World Example

To demonstrate the benefit of Incremental Global Statistics, I created a range partitioned table consisting of 60 range partitions. The target table starts empty and one million (1,000,000) rows are inserted into a single partition of the table and then statistics are gathered. This is done 60 times, simulating loading 60 one day partitions (one at a time) emulating a daily ETL/ELT process over 60 days.


Incremental_Stats.png

Elapsed Times
Partitions
Incremental=FALSE
Incremental=TRUE
1
00:00:20.36
00:00:21.14
10
00:02:27.25
00:00:37.76
20
00:04:46.23
00:00:49.83
30
00:07:05.47
00:01:01.80
40
00:09:11.09
00:01:23.33
50
00:11:33.18
00:01:30.40
60
00:13:18.15
00:01:40.28
Cumulative Elapsed Time
06:42:21.20
01:00:53.80

As you can see from the chart and the table, without Incremental Global Stats the time to gather stats increases pretty much linearly with the number of partitions, but with Incremental Global Stats the elapse time only slightly increases. The big difference is in the cumulative elapsed time: It takes 6 hours 42 minutes without Incremental Global Stats, but only 1 hour with. Quite a significant savings over time!

Revisiting The Math

For this experiment the time to gather stats without Incremental Global Stats is:
(time to scan & gather for 1 partition) + (time to scan and gather for entire table)
When Incremental Global Stats is used the time to gather stats is:
(time to scan & gather for 1 partition) + (time to aggregate all synopses)

The Diff Test

I exported the stats into a stats table and then ran the diff to compare the two runs. This will show us how comparable the two methods of stats gathering are.

SQL> set long 500000 longchunksize 500000
SQL> select report, maxdiffpct from
     table(dbms_stats.diff_table_stats_in_stattab(user,'CATALOG_SALES','STATS_DEFAULT','STATS_INC'));

REPORT
------------------------------------------------------------------------------------
MAXDIFFPCT
----------
###############################################################################

STATISTICS DIFFERENCE REPORT FOR:
.................................

TABLE	      : CATALOG_SALES
OWNER	      : TPCDS
SOURCE A      : User statistics table STATS_DEFAULT
	      : Statid	   :
	      : Owner	   : TPCDS
SOURCE B      : User statistics table STATS_INC
	      : Statid	   :
	      : Owner	   : TPCDS
PCTTHRESHOLD  : 10
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


NO DIFFERENCE IN TABLE / (SUB)PARTITION STATISTICS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

COLUMN STATISTICS DIFFERENCE:
.............................

COLUMN_NAME	SRC NDV     DENSITY    HIST NULLS   LEN  MIN   MAX   SAMPSIZ
...............................................................................

CS_BILL_ADDR_SK A   1001152 .000000998 NO   148640  5	 C102  C402  5.9E+07
		B   1001176 .000000998 NO   148613  5	 C102  C402  5.9E+07
CS_BILL_CDEMO_S A   1868160 .000000535 NO   148646  6	 C102  C4025 5.9E+07
		B   1878320 .000000532 NO   148753  6	 C102  C4025 5.9E+07
CS_BILL_CUSTOME A   1942528 .000000514 NO   148104  6	 C102  C403  5.9E+07
		B   1949464 .000000512 NO   148192  6	 C102  C403  5.9E+07
CS_BILL_HDEMO_S A   7200    .000138888 NO   148227  4	 C102  C249  5.9E+07
		B   7200    .000138888 NO   148250  4	 C102  C249  5.9E+07
CS_CALL_CENTER_ A   30	    .033333333 NO   148310  3	 C102  C11F  5.9E+07
		B   30	    .033333333 NO   148272  3	 C102  C11F  5.9E+07
CS_CATALOG_PAGE A   11092   .000090155 NO   148111  5	 C102  C3023 5.9E+07
		B   11092   .000090155 NO   148154  5	 C102  C3023 5.9E+07
CS_EXT_LIST_PRI A   1133824 .000000881 NO   148461  6	 C102  C3036 5.9E+07
		B   1131680 .000000883 NO   148368  6	 C102  C3036 5.9E+07
CS_EXT_WHOLESAL A   394880  .000002532 NO   148842  5	 C102  C302  5.9E+07
		B   394880  .000002532 NO   148772  5	 C102  C302  5.9E+07
CS_ITEM_SK	A   205888  .000004857 NO   0	    5	 C102  C3152 5.9E+07
		B   205408  .000004868 NO   0	    5	 C102  C3152 5.9E+07
CS_LIST_PRICE	A   29896   .000033449 NO   148438  5	 C102  C204  5.9E+07
		B   29896   .000033449 NO   148458  5	 C102  C204  5.9E+07
CS_ORDER_NUMBER A   7151104 .000000139 NO   0	    6	 C102  C4102 5.9E+07
		B   7122072 .000000140 NO   0	    6	 C102  C4102 5.9E+07
CS_PROMO_SK	A   1000    .001       NO   148617  4	 C102  C20B  5.9E+07
		B   1000    .001       NO   148693  4	 C102  C20B  5.9E+07
CS_QUANTITY	A   100     .01        NO   148737  3	 C102  C202  5.9E+07
		B   100     .01        NO   148751  3	 C102  C202  5.9E+07
CS_SHIP_ADDR_SK A   1001088 .000000998 NO   148150  5	 C102  C402  5.9E+07
		B   1001152 .000000998 NO   148235  5	 C102  C402  5.9E+07
CS_SHIP_CDEMO_S A   1870592 .000000534 NO   148918  6	 C102  C4025 5.9E+07
		B   1878272 .000000532 NO   148862  6	 C102  C4025 5.9E+07
CS_SHIP_CUSTOME A   1938816 .000000515 NO   148300  6	 C102  C403  5.9E+07
		B   1948928 .000000513 NO   148309  6	 C102  C403  5.9E+07
CS_SHIP_DATE_SK A   1884    .000530785 NO   148674  6	 C4032 C4032 5.9E+07
		B   1884    .000530785 NO   148608  6	 C4032 C4032 5.9E+07
CS_SHIP_HDEMO_S A   7200    .000138888 NO   148172  4	 C102  C249  5.9E+07
		B   7200    .000138888 NO   148161  4	 C102  C249  5.9E+07
CS_SHIP_MODE_SK A   20	    .05        NO   148437  3	 C102  C115  5.9E+07
		B   20	    .05        NO   148486  3	 C102  C115  5.9E+07
CS_SOLD_DATE_SK A   1595    .000626959 NO   0	    6	 C4032 C4032 5.9E+07
		B   1587    .000630119 NO   0	    6	 C4032 C4032 5.9E+07
CS_WAREHOUSE_SK A   15	    .066666666 NO   148651  3	 C102  C110  5.9E+07
		B   15	    .066666666 NO   148620  3	 C102  C110  5.9E+07
CS_WHOLESALE_CO A   9901    .000100999 NO   149054  4	 C102  C202  5.9E+07
		B   9901    .000100999 NO   149099  4	 C102  C202  5.9E+07

The stats diff shows that for many columns the NDV is identical and the others are statistically equivalent (close enough to be the same). I will certainly be adding this feature to my “conviction must use list” for Oracle 11g.

Further Reading

If you are interested in the bits and bytes of how the synopsis-based method works, I would suggest you read the whitepaper, Efficient and Scalable Statistics Gathering for Large Databases in Oracle 11g that was presented on this topic at SIGMOD 2008.

Using Bitmap Indexes Effectively

Recently I was reading this thread, “Trying to make use of bitmap indexes” on the Oracle Forum. Before I had finished a working example, Jonathan Lewis had posted his response which was on par with my thoughts. Since this is a topic I see frequently, I thought I would finish my experiment and publish it here.

What We Are Given

The author of the original post had stated that the table in question contains about 16 million rows and states: “The table contains three IDEAL columns for bitmap indexes the first of which
may have only two, the second three and the third four distinct values. I was planning to change the index type on these columns to BITMAP [from B-tree]
.” To keep the focus of this post narrow, I’m only going to discuss whether or not one should consider bitmap indexes for queries, and not discuss potential update related issues.

The Data

For this experiment, I’m going to create a table that has three columns with the given NDV from above and add in a few extra filler columns to pad it out a bit. Since I do not know the exact table structure, I’ll just go with a simple example. In reality, the posters table may be wider, but for this example, it is what it is.

create table bm_test
nologging compress
as
select
  round(dbms_random.value(1, 2)) a  -- NDV 2
, round(dbms_random.value(1, 3)) b  -- NDV 3
, round(dbms_random.value(1, 4)) c  -- NDV 4
, to_char(800000+100000*dbms_random.normal,'fm000000000000') c3
, to_char(800000+100000*dbms_random.normal,'fm000000000000') c4
, to_char(15000+2000*dbms_random.normal,'fm000000') c5
, to_char(80000+10000*dbms_random.normal,'fm000000000000') c6
from dual
connect by level <= 16000000
/

desc bm_test
 Name		   Null?    Type
 ----------------- -------- ------------
 A			    NUMBER
 B			    NUMBER
 C			    NUMBER
 C3			    VARCHAR2(13)
 C4			    VARCHAR2(13)
 C5			    VARCHAR2(7)
 C6			    VARCHAR2(13)

exec dbms_stats.gather_table_stats(user,'BM_TEST');

create bitmap index bm1 on bm_test(a);
create bitmap index bm2 on bm_test(b);
create bitmap index bm3 on bm_test(c);

select a, b, c, count(*)
from bm_test
group by a,b,c
order by a,b,c;

         A          B          C   COUNT(*)
---------- ---------- ---------- ----------
         1          1          1     333292
         1          1          2     666130
         1          1          3     666092
         1          1          4     333585
         1          2          1     668594
         1          2          2    1332121
         1          2          3    1332610
         1          2          4     668608
         1          3          1     333935
         1          3          2     666055
         1          3          3     666619
         1          3          4     333106
         2          1          1     333352
         2          1          2     665038
         2          1          3     665000
         2          1          4     333995
         2          2          1     669120
         2          2          2    1332744
         2          2          3    1332766
         2          2          4     668411
         2          3          1     333891
         2          3          2     665924
         2          3          3     664799
         2          3          4     334213

24 rows selected.

select segment_name,
       segment_type,
       sum(blocks) blocks,
       sum(bytes)/1024/1024 mb
from user_segments
where segment_name like 'BM%'
group by segment_name, segment_type;

SEGMENT_NAME SEGMENT_TYPE     BLOCKS         MB
------------ ------------ ---------- ----------
BM_TEST      TABLE            102592      801.5
BM1          INDEX               768          6
BM2          INDEX              1152          9
BM3          INDEX              1408         11

select object_name, object_id
from user_objects
where object_name like 'BM%'

OBJECT_NAME   OBJECT_ID
------------ ----------
BM_TEST           54744
BM1               54745
BM2               54746
BM3               54747

The Queries And Execution Plans

The original post did not contain any queries or predicates, so for the purpose of this example I’m going to assume that there are exactly three predicates, one on each of column A, B and C, and that each predicate is a single equality (e.g. A=1 and B=1 and C=1). Looking at the data distribution from the query above, we observe there are approximately three different grouping counts: the lower around 333,000 the middle around 666,000 and the upper around 1,300,000. I will choose tuples from each of these groupings for the three test cases.

Query A

select *
from bm_test
where a=1 and b=1 and c=1;

333292 rows selected.
Plan hash value: 3643416817

----------------------------------------------------------------------------------------
| Id  | Operation                    | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |         |       |       | 23314 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID | BM_TEST |   326K|    17M| 23314   (1)| 00:04:40 |
|   2 |   BITMAP CONVERSION TO ROWIDS|         |       |       |            |          |
|   3 |    BITMAP AND                |         |       |       |            |          |
|*  4 |     BITMAP INDEX SINGLE VALUE| BM3     |       |       |            |          |
|*  5 |     BITMAP INDEX SINGLE VALUE| BM2     |       |       |            |          |
|*  6 |     BITMAP INDEX SINGLE VALUE| BM1     |       |       |            |          |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("C"=1)
   5 - access("B"=1)
   6 - access("A"=1)

Query B

select *
from bm_test
where a=1 and b=1 and c=2;

666130 rows selected.
Plan hash value: 3202922749

----------------------------------------------------------------------------------------
| Id  | Operation                    | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |         |       |       | 27105 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID | BM_TEST |   653K|    34M| 27105   (1)| 00:05:26 |
|   2 |   BITMAP CONVERSION TO ROWIDS|         |       |       |            |          |
|   3 |    BITMAP AND                |         |       |       |            |          |
|*  4 |     BITMAP INDEX SINGLE VALUE| BM2     |       |       |            |          |
|*  5 |     BITMAP INDEX SINGLE VALUE| BM1     |       |       |            |          |
|*  6 |     BITMAP INDEX SINGLE VALUE| BM3     |       |       |            |          |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("B"=1)
   5 - access("A"=1)
   6 - access("C"=2)

Query C

select *
from bm_test
where a=1 and b=2 and c=2;

1332121 rows selected.
Plan hash value: 1873942893

-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |       |       | 28243 (100)|          |
|*  1 |  TABLE ACCESS FULL| BM_TEST |  1377K|    72M| 28243   (2)| 00:05:39 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(("C"=2 AND "B"=2 AND "A"=1))

As you can see from the execution plans, Query A and B use the bitmap indexes and Query C uses a Full Table Scan. Of the 16,000,000 rows, Query A returns 333,292 (2.08%), Query B returns 666,130 (4.16%) and Query C returns 1,332,121 rows (8.33%). I think it is important to note that the change in the execution plan from index access to table scan is due to the costing, not directly due to the percentage of data returned.

Execution Times

I’m going to gather two sets of execution times. The first will be with a cold buffer cache, and the second with a warm buffer cache. All elapsed times are in seconds.

Query Execution Plan Cold Cache Warm Cache
A Bitmap Index 38 3
B Bitmap Index 40 4
C FTS 16 16

As you can see from the execution times, there is a significant difference (approx. 11x) between the cold and warm cache executions of each Query A and Query B. The other observation is that Query C (FTS) is faster than Query A (Index Access) on a cold cache. We surely need to account for this. One observation I made (from iostat) is that the I/O throughput rate for Query A and Query B was around 23MB/s while the I/O rate for Query C was around the 55MB/s range during the cold cache execution. None of the queries used the Parallel Query Option.

Lets take a look at the tkprof output from both the cold and warm cache executions of Query A and see if we can find where the time is being spent. The traces were collected using event 10046, level 8.

Query A TKPROF – Warm Cache

select /* warm cache */ *
from bm_test
where a=1 and b=1 and c=1


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     3334      2.20       2.17          0     102184          0      333292
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     3336      2.20       2.18          0     102184          0      333292

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 31

Rows     Row Source Operation
-------  ---------------------------------------------------
 333292  TABLE ACCESS BY INDEX ROWID BM_TEST (cr=102184 pr=0 pw=0 time=19332 us cost=23314 size=17945290 card=326278)
 333292   BITMAP CONVERSION TO ROWIDS (cr=1162 pr=0 pw=0 time=2329 us)
     92    BITMAP AND  (cr=1162 pr=0 pw=0 time=1691 us)
    642     BITMAP INDEX SINGLE VALUE BM3 (cr=367 pr=0 pw=0 time=104 us)(object id 54747)
    697     BITMAP INDEX SINGLE VALUE BM2 (cr=396 pr=0 pw=0 time=92 us)(object id 54746)
    727     BITMAP INDEX SINGLE VALUE BM1 (cr=399 pr=0 pw=0 time=117 us)(object id 54745)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                    3337        0.00          0.00
  SQL*Net message from client                  3337        0.00          1.04

When the cache is warm, there are no physical reads that take place. This would explain the fast execution of the query.

Note: For Bitmap execution plans, the number that appears in the rows column is actually bitmap fragments (compressed rowids), not actual rows. This is why the number looks suspiciously small.

Query A TKPROF – Cold Cache

select /* cold cache */ *
from bm_test
where a=1 and b=1 and c=1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     3334     11.44      36.22      99722     102184          0      333292
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     3336     11.45      36.22      99722     102184          0      333292

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 31

Rows     Row Source Operation
-------  ---------------------------------------------------
 333292  TABLE ACCESS BY INDEX ROWID BM_TEST (cr=102184 pr=99722 pw=99722 time=294694 us cost=23314 size=17945290 card=326278)
 333292   BITMAP CONVERSION TO ROWIDS (cr=1162 pr=1041 pw=1041 time=2490 us)
     92    BITMAP AND  (cr=1162 pr=1041 pw=1041 time=5104 us)
    642     BITMAP INDEX SINGLE VALUE BM3 (cr=367 pr=324 pw=324 time=1840 us)(object id 54747)
    697     BITMAP INDEX SINGLE VALUE BM2 (cr=396 pr=351 pw=351 time=1817 us)(object id 54746)
    727     BITMAP INDEX SINGLE VALUE BM1 (cr=399 pr=366 pw=366 time=1534 us)(object id 54745)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                    3336        0.00          0.00
  SQL*Net message from client                  3336        0.00          1.12
  db file sequential read                     99722        0.04         30.60

As you can see the majority of the time was spent on db file sequential read doing the 99,722 physical reads. This explains the difference in elapsed time between the cold and warm cache executions of Query A: it comes down to physical I/O. But why does Query C run in half the time that Query A runs in when the cache is cold, given that Query C is doing a FTS and Query A is not? Shouldn’t the FTS plan be slower than the index plan?

Looking at the raw trace file for Query A, we observe the following:

WAIT #2: nam='db file sequential read' ela= 241 file#=1 block#=1770152 blocks=1 obj#=54744 tim=1212013191665924
WAIT #2: nam='db file sequential read' ela= 232 file#=1 block#=1770153 blocks=1 obj#=54744 tim=1212013191666240
WAIT #2: nam='db file sequential read' ela= 351 file#=1 block#=1770156 blocks=1 obj#=54744 tim=1212013191666650
WAIT #2: nam='db file sequential read' ela= 240 file#=1 block#=1770157 blocks=1 obj#=54744 tim=1212013191666948
WAIT #2: nam='db file sequential read' ela= 298 file#=1 block#=1770158 blocks=1 obj#=54744 tim=1212013191667306

As you can see, the table is being read sequentially 1 block at a time. Let’s examine the TKPROF from Query C.

Query C TKPROF

select *
from bm_test
where a=1 and b=2 and c=2

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch    13323      5.99      11.17     102592     115831          0     1332121
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    13325      5.99      11.17     102592     115831          0     1332121

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 31

Rows     Row Source Operation
-------  ---------------------------------------------------
1332121  TABLE ACCESS FULL BM_TEST
(cr=115831 pr=102592 pw=102592 time=102744 us cost=28243 size=75768825 card=1377615)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                   13325        0.00          0.01
  SQL*Net message from client                 13325        0.00          4.23
  db file sequential read                         2        0.02          0.03
  direct path read                              952        0.08          5.20

The majority of the time is spent on direct path read.

Let’s dig deeper and look at the raw trace file from Query C.

WAIT #2: nam='direct path read' ela= 6029 file number=1 first dba=1785609 block cnt=128 obj#=54744 tim=1212013229612857
WAIT #2: nam='direct path read' ela= 8638 file number=1 first dba=1787017 block cnt=128 obj#=54744 tim=1212013229628256
WAIT #2: nam='direct path read' ela= 7019 file number=1 first dba=1789193 block cnt=128 obj#=54744 tim=1212013229642410
WAIT #2: nam='direct path read' ela= 9276 file number=1 first dba=1791497 block cnt=128 obj#=54744 tim=1212013229658400
WAIT #2: nam='direct path read' ela= 6173 file number=1 first dba=1792777 block cnt=128 obj#=54744 tim=1212013229671314

As you can see with Query C, the read size is 128 blocks or 1MB (128 blocks * 8k block), the largest I/O that Oracle will issue. This explains the difference in the observed I/O throughput (23MB/s vs. 55MB/s): the bitmap index plan reads the table 1 block at a time, and the FTS reads (most of) it 128 blocks at a time. It makes good sense that if the read throughput rate is ~2x (23MB/s vs. 55MB/s) then the execution time would be ~0.5 as long (38 seconds vs. 16 seconds). The larger I/O size will have a higher throughput rate compared to a smaller I/O size. The exact breakdown of the multi-block reads are:

BLOCK_COUNT      COUNT TOTAL_BLOCKS
----------- ---------- ------------
          7          2           14
          8        106          848
          9         34          306
         16         10          160
         33          8          264
        119         42         4998
        128        750        96000
            ---------- ------------
sum                952       102590

Making Sense Of All The Observations

If we look at the tkprof output again from Query A, we see there are 99,722 waits on db file sequential read. Of those 99,722 waits, 98,681 are on the table (grep is our friend here using the raw trace file and the event and object number), the remaining are for the indexes. This tells us that 98,681 out of 102,592 total blocks of the table were retrieved, just 1 block at a time. Basically we have done a very inefficient full table scan. This explains our two observations: 1) why the FTS is faster than the index access plan with a cold cache and 2) why the FTS has a higher read throughput than the index access plan. It all comes down to efficient physical I/O.

The Big Picture

Just because a column has a low NDV does not necessarily mean it is an ideal candidate for a bitmap index. Just like B-tree indexes, bitmap indexes are best leveraged when the combination of them makes it very selective (returns only a small number of rows). The classic example of using a bitmap index on a gender column (male/female) is a horrible one in my opinion. If there are only two values, and there is an even distribution of data, 50% selectivity is too large and thus not a good candidate for a bitmap index. Would you use any index to access 50% of a table?

Bitmap indexes can be very useful in making queries run fast, but if the BITMAP CONVERSION TO ROWIDS returns a large list of rowids, you may find that a FTS (or partition scan) may yield better performance, but may use more I/O resources. It comes down to a trade off: If there is a high buffer cache hit rate for the objects in the bitmap plans, it will run reasonably fast and requite less physical I/O. If the objects are unlikely to be in the buffer cache, a FTS will yield better performance as long as it is not bottlenecked on I/O bandwidth.

Null-Aware Anti-Join

Recently someone showed me a query execution plan with an operation of HASH JOIN ANTI NA. At first, it was thought maybe it was a bug and the operation had a type-o in it, but after further research it was discovered it was a valid operation and a new cost-based query transformation for subquery unnesting in Oracle Database 11g. The NA stands for Null-Aware. There is also a second type of Null-Aware Anti-Join, which is the Single Null-Aware Anti-Join which is displayed in the execution plan as ANTI SNA. The null-aware anti-join may be computed using each of the three types of of join operations: the sort-merge join, hash join and nested loops join.

What is the advantage of a Null-Aware Anti-Join? If we look at the patent application for Null-Aware Anti-Joins we will see that paragraph 0006 gives a brief description:

[0006] A common type of query that is optimized is a query that contains a subquery whose join condition involves the NOT IN/ALL operator (NOT IN is equivalent to != ALL). In data-warehouses with reporting applications, such queries and subqueries are usually evaluated on very large sets of data. Thus, it is critical to make such queries scale in any SQL execution engine. When such queries are not optimized using anti-join, the subquery is executing an operation that is effectively a Cartesian product, which is quite inefficient.

Before we look at the performance side of things, lets just take a look at some simple examples with our favorite EMP table.

SQL> select * from emp;

     EMPNO ENAME      JOB	       MGR HIREDATE	     SAL       COMM	DEPTNO
---------- ---------- --------- ---------- ---------- ---------- ---------- ----------
      7369 SMITH      CLERK	      7902 1980-12-17	     800		    20
      7499 ALLEN      SALESMAN	      7698 1981-02-20	    1600	300	    30
      7521 WARD       SALESMAN	      7698 1981-02-22	    1250	500	    30
      7566 JONES      MANAGER	      7839 1981-04-02	    2975		    20
      7654 MARTIN     SALESMAN	      7698 1981-09-28	    1250       1400	    30
      7698 BLAKE      MANAGER	      7839 1981-05-01	    2850		    30
      7782 CLARK      MANAGER	      7839 1981-06-09	    2450		    10
      7788 SCOTT      ANALYST	      7566 1987-04-19	    3000		    20
      7839 KING       PRESIDENT 	   1981-11-17	    5000		    10
      7844 TURNER     SALESMAN	      7698 1981-09-08	    1500	  0	    30
      7876 ADAMS      CLERK	      7788 1987-05-23	    1100		    20
      7900 JAMES      CLERK	      7698 1981-12-03	     950		    30
      7902 FORD       ANALYST	      7566 1981-12-03	    3000		    20
      7934 MILLER     CLERK	      7782 1982-01-23	    1300		    10

14 rows selected.

As you can see, there is one row where MGR is null.

In the below examples, I’m going to refer to the outer query as the left side, and the subquery as the right side. Each test case has the query, the execution plan and a snippet of the 10053 trace.

11.1.0.6
Test Case 1: Either Side Can Be Null

select count(*)
from   emp
where  mgr not in (select mgr from emp);
Execution Plan
----------------------------------------------------------
Plan hash value: 54517352

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |     8 |     5  (20)| 00:00:01 |
|   1 |  SORT AGGREGATE     |      |     1 |     8 |            |          |
|*  2 |   HASH JOIN ANTI NA |      |    13 |   104 |     5  (20)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| EMP  |    14 |    56 |     2   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| EMP  |    14 |    56 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("MGR"="MGR")

*****************************
Cost-Based Subquery Unnesting
*****************************
SU: Unnesting query blocks in query block SEL$1 (#1) that are valid to unnest.
Subquery Unnesting on query block SEL$1 (#1)
SU: Performing unnesting that does not require costing.
SU: Considering subquery unnest on query block SEL$1 (#1).
SU:   Checking validity of unnesting subquery SEL$2 (#2)
SU:   Passed validity checks.
SU:   Transform ALL subquery to a null-aware antijoin.
Registered qb: SEL$5DA710D3 0x77a2e6bc (SUBQUERY UNNEST SEL$1; SEL$2)

Test Case 2: Right Side Is Not Null

select count(*)
from   emp
where  mgr not in (select mgr from emp where mgr is not null);
Execution Plan
----------------------------------------------------------
Plan hash value: 2818854569

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |     8 |     5  (20)| 00:00:01 |
|   1 |  SORT AGGREGATE     |      |     1 |     8 |            |          |
|*  2 |   HASH JOIN ANTI SNA|      |    13 |   104 |     5  (20)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| EMP  |    14 |    56 |     2   (0)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL| EMP  |    13 |    52 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("MGR"="MGR")
   4 - filter("MGR" IS NOT NULL)

*****************************
Cost-Based Subquery Unnesting
*****************************
SU: Unnesting query blocks in query block SEL$1 (#1) that are valid to unnest.
Subquery Unnesting on query block SEL$1 (#1)
SU: Performing unnesting that does not require costing.
SU: Considering subquery unnest on query block SEL$1 (#1).
SU:   Checking validity of unnesting subquery SEL$2 (#2)
SU:   Passed validity checks.
SU: Transform ALL subquery to a single null-aware antijoin.
Registered qb: SEL$5DA710D3 0x67e897e8 (SUBQUERY UNNEST SEL$1; SEL$2)

Test Case 3: Left Side Is Not Null

select count(*)
from   emp
where  mgr not in (select mgr from emp) and
       mgr is not null;
Execution Plan
----------------------------------------------------------
Plan hash value: 54517352

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |     8 |     5  (20)| 00:00:01 |
|   1 |  SORT AGGREGATE     |      |     1 |     8 |            |          |
|*  2 |   HASH JOIN ANTI NA |      |    12 |    96 |     5  (20)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| EMP  |    13 |    52 |     2   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| EMP  |    14 |    56 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("MGR"="MGR")
   3 - filter("MGR" IS NOT NULL)

*****************************
Cost-Based Subquery Unnesting
*****************************
SU: Unnesting query blocks in query block SEL$1 (#1) that are valid to unnest.
Subquery Unnesting on query block SEL$1 (#1)
SU: Performing unnesting that does not require costing.
SU: Considering subquery unnest on query block SEL$1 (#1).
SU:   Checking validity of unnesting subquery SEL$2 (#2)
SU:   Passed validity checks.
SU:   Transform ALL subquery to a null-aware antijoin.
SU:   Checking validity of unnesting subquery SEL$2 (#3)
SU:   Validity checks failed.
Registered qb: SEL$5DA710D3 0x7a357c98 (SUBQUERY UNNEST SEL$1; SEL$2)

Test Case 4: Neither Side Is Null

select count(*)
from   emp
where  mgr not in (select mgr from emp where mgr is not null) and
       mgr is not null;
Execution Plan
----------------------------------------------------------
Plan hash value: 868928733

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |     8 |     5  (20)| 00:00:01 |
|   1 |  SORT AGGREGATE     |      |     1 |     8 |            |          |
|*  2 |   HASH JOIN ANTI    |      |    12 |    96 |     5  (20)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| EMP  |    13 |    52 |     2   (0)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL| EMP  |    13 |    52 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("MGR"="MGR")
   3 - filter("MGR" IS NOT NULL)
   4 - filter("MGR" IS NOT NULL)
*****************************
Cost-Based Subquery Unnesting
*****************************
SU: Unnesting query blocks in query block SEL$1 (#1) that are valid to unnest.
Subquery Unnesting on query block SEL$1 (#1)
SU: Performing unnesting that does not require costing.
SU: Considering subquery unnest on query block SEL$1 (#1).
SU:   Checking validity of unnesting subquery SEL$2 (#2)
SU:   Passed validity checks.
SU:   Transform ALL subquery to a regular antijoin.
Registered qb: SEL$5DA710D3 0x73a4d370 (SUBQUERY UNNEST SEL$1; SEL$2)

As you can see in Test Case 1 and Test Case 3, the optimizer chooses a Null-Aware Anti-Join. In Test Case 2, a Single Null-Aware Anti-Join is chosen, and in Test Case 4 a Regular Anti-Join is chosen.

Let’s compare the plans to 10.2.0.4. I used optimizer_features_enable='10.2.0.4' on my 11.1.0.6 database as well as tested it on 10.2.0.4; the plans are identical in both cases.

10.2.0.4
Test Case 1: Either Side Can Be Null

select count(*)
from   emp
where  mgr not in (select mgr from emp);
Execution Plan
----------------------------------------------------------
Plan hash value: 1842922539

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |     4 |    14   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE     |      |     1 |     4 |            |          |
|*  2 |   FILTER            |      |       |       |            |          |
|   3 |    TABLE ACCESS FULL| EMP  |    14 |    56 |     2   (0)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL| EMP  |     2 |     8 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter( NOT EXISTS (SELECT 0 FROM "EMP" "EMP" WHERE
              LNNVL("MGR"<>:B1)))
   4 - filter(LNNVL("MGR"<>:B1))

Test Case 2: Right Side Is Not Null

select count(*)
from   emp
where  mgr not in (select mgr from emp where mgr is not null);
Execution Plan
----------------------------------------------------------
Plan hash value: 1842922539

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |     4 |    14   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE     |      |     1 |     4 |            |          |
|*  2 |   FILTER            |      |       |       |            |          |
|   3 |    TABLE ACCESS FULL| EMP  |    14 |    56 |     2   (0)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL| EMP  |     2 |     8 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter( NOT EXISTS (SELECT 0 FROM "EMP" "EMP" WHERE "MGR" IS NOT
              NULL AND LNNVL("MGR"<>:B1)))
   4 - filter("MGR" IS NOT NULL AND LNNVL("MGR"<>:B1))

Test Case 3: Left Side Is Not Null

select count(*)
from   emp
where  mgr not in (select mgr from emp) and
       mgr is not null;
Execution Plan
----------------------------------------------------------
Plan hash value: 1842922539

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |     4 |    14   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE     |      |     1 |     4 |            |          |
|*  2 |   FILTER            |      |       |       |            |          |
|*  3 |    TABLE ACCESS FULL| EMP  |    13 |    52 |     2   (0)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL| EMP  |     2 |     8 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter( NOT EXISTS (SELECT 0 FROM "EMP" "EMP" WHERE
              LNNVL("MGR"<>:B1)))
   3 - filter("MGR" IS NOT NULL)
   4 - filter(LNNVL("MGR"<>:B1))

Test Case 4: Neither Side Is Null

select count(*)
from   emp
where  mgr not in (select mgr from emp where mgr is not null) and
       mgr is not null;
Execution Plan
----------------------------------------------------------
Plan hash value: 868928733

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |     8 |     5  (20)| 00:00:01 |
|   1 |  SORT AGGREGATE     |      |     1 |     8 |            |          |
|*  2 |   HASH JOIN ANTI    |      |     1 |     8 |     5  (20)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| EMP  |    13 |    52 |     2   (0)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL| EMP  |    13 |    52 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("MGR"="MGR")
   3 - filter("MGR" IS NOT NULL)
   4 - filter("MGR" IS NOT NULL)

In 10.2.0.4 each of Test Case 1-3 have the same execution plan, but a different one than in 11.1.0.6 because of the new query transformation. Test Case 4 has the same plan in both 10.2.0.4 and 11.1.0.6, which is expected, because neither side can be null and the new query transformation does not kick in. Note the difference on line 2: The 11g plans use the null-aware anti-join, and the 10g plans use a filter.

Performance Test

For a performance test case, I’m going to create two tables of 100,000 rows using the below script and run the Test Cases against them setting OFE to 11.1.0.6 and 10.2.0.4:

drop table t1;

create table t1
as
select case when mod((rownum + 90000),1000) = 0
            then null
            else rownum
       end as a
from dual
connect by level <= 100000;

exec dbms_stats.gather_table_stats(user,'t1');

drop table t2;

create table t2
as
select case when mod((rownum + 90000),1000) = 0
            then null
            else rownum + 90000
       end as a
from dual
connect by level <= 100000;

exec dbms_stats.gather_table_stats(user,'t2');

Performance Test Results

Test Case 10.2.0.4 11.1.0.6
1 00:00:08.24 00:00:00.05
2 00:12:31.24 00:00:00.10
3 00:00:09.08 00:00:00.05
4 00:00:00.10 00:00:00.10

Test Case 1 and 3 have around 82x better time with the 11.1.0.6 plan compared to 10.2.0.4, but the significant difference is with Test Case 2. It’s time was reduced by 7500x or so; from over 12 minutes to less than 1 second. If we examine the 10.2.0.4 plans, we see the optimizer applies a filter push-down transformation using NOT EXISTS and LNNVL.

Let’s examine the statistics of each execution from autotrace.

10.2.0.4 Plan

Execution Plan
----------------------------------------------------------
Plan hash value: 59119136

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |     4 |  4014K  (5)| 13:22:56 |
|   1 |  SORT AGGREGATE     |      |     1 |     4 |            |          |
|*  2 |   FILTER            |      |       |       |            |          |
|   3 |    TABLE ACCESS FULL| T1   |   100K|   390K|    45   (5)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL| T2   |     1 |     4 |    45   (5)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter( NOT EXISTS (SELECT 0 FROM "T2" "T2" WHERE "T2"."A" IS
              NOT NULL AND LNNVL("T2"."A"<>:B1)))
   4 - filter("T2"."A" IS NOT NULL AND LNNVL("T2"."A"<>:B1))


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
   14137436  consistent gets
          0  physical reads
          0  redo size
        420  bytes sent via SQL*Net to client
        415  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

11.1.0.6 Plan

Execution Plan
----------------------------------------------------------
Plan hash value: 1028670007

------------------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |     8 |       |   245   (3)| 00:00:03 |
|   1 |  SORT AGGREGATE     |      |     1 |     8 |       |            |          |
|*  2 |   HASH JOIN ANTI SNA|      |  9998 | 79984 |  1568K|   245   (3)| 00:00:03 |
|   3 |    TABLE ACCESS FULL| T1   |   100K|   390K|       |    44   (3)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL| T2   | 99900 |   390K|       |    45   (5)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."A"="T2"."A")
   4 - filter("T2"."A" IS NOT NULL)


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        312  consistent gets
          0  physical reads
          0  redo size
        420  bytes sent via SQL*Net to client
        415  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

The big difference here is that the HASH JOIN ANTI SNA plan has significantly less consistent gets: 312 vs. 14,137,436 – over a 45,000x difference!!! Hence the 12 minutes to less than 1 second execution time. I think it is quite safe to say that the HASH JOIN ANTI SNA is much better than the FILTER plan.

As demonstrated, the Null-Aware Anti-Join query transformation can have a significant performance, even on two tables consisting of a modest 100,000 rows of data.

Top Ways How Not To Scale Your Data Warehouse

Working in the Real-World Performance Group at Oracle has allowed me to see quite a few customers’ data warehouses. Unfortunately some customers find their data warehouse suffering from performance problems, not because there is a platform issue, but often because the features are not used or are not used correctly. I thought I’d put together a list of the most common problems but present them in a facetious manner. The following is meant to be sarcastic and read with a bit of humor. Consider it the “Comedy of Errors” data warehouse edition.

Add An Index To Fix Each Slow Query

If you think the solution for slow queries in a data warehouse is to add indexes, you are probably mistaken. Not only is index access likely the most inefficient way to access lots of rows, the more indexes that are on a table, the longer it takes to load data into that table, even if you build them after the data is loaded; not to mention the extra space that is required for those indexes.

Do Not Use Parallel Operations

There is a reason that row-by-row processing is synonymous with slow-by-slow. It does not scale. If you want to make sure you will have no chance to scale your ETL/ELT, use PL/SQL cursor for loops and process the data serially.

Parallel query is one of the best ways to linearly reduce query response times so not using it would surely help you not scale your data warehouse. Why would one want to solve a few queries in a short amount of time by leveraging a large amount of hardware resources with many parallel processes when you can run many serial queries each using a single process? After all, you would not want to make your CPUs and disks too busy, as they might get tired.

Do Not Use Partitioning

Not effectively using partitioning is probably the best way not to scale your data warehouse. Without leveraging partition elimination in queries, is is pretty much a guarantee that the more data you add to your tables, the longer your queries will run. Give yourself enough time and data, and you will surely have the slowest data warehouse on the planet, if not the most inefficient.

Do Not Use Compression

Using compression would not only reduce the disk space required to store data, it also would reduce the amount of I/O bandwidth to bring data back to the host. Given that most database servers have infinite storage capacity as well as infinite I/O bandwidth, compression would not yield much, if any benefit. Why do less work when you can do more?

Besides, using compression makes the load take longer, right? Given that the data is loaded once and queried thousands (or more) times, it would make sense to optimize for loading and not queries, no?

Do Not Use Analytic SQL

Why would anyone want to use analytic SQL when they can write simpler SQL that performs slower without that functionality. After all, isn’t it better to access the data multiple times versus a single pass? The more times the data is accessed, the more likely it is to be in cache, correct?

Do Not Use Materialized Views Or Aggregates

Why use a “work smarter, not harder” mentality when you can just brute force it. After all, we are manly men and will not be shown up by brains and elegance. We’d rather add up every row from the point of sale table for year end report versus adding up the monthly aggregates. Disk space is cheap and performance is expensive, so perhaps if enough space is saved, it will add up, right?

Do Not Store Data In Its Most Usable Format

Storing data in a uniform case (all upper or all lower) would make the data too clean and usable. It’s much more efficient to change the data with an upper() or lower() function each and every time the data is queried versus changing it once when it is loaded. And while we are at it, let’s store dates as strings and convert them every time as well. Since there is idle CPU from running serial queries, might as well put it to good use changing the case of text and turning strings into dates, no? Why store that GL segment number that everyone queries on in its own column when the users can just use a substring() function to find it. After all, storing redundant data that would make queries perform better is not the primary objective, it’s saving those bytes per row that is critical.

Do Not Leverage Good Design

A data warehouse model is just an OLTP model with much more historical data and a few extra tables for reports, right? One of the best ways to tank your data warehouse’s performance is to design and manage it like it was a large OLTP database. Fundamentally OLTP and data warehousing are completely different and have very little in common, but they can’t be that different, can they?

Even though most BI tools work best when there is well designed model, it is possible to just put a layer of performance killing views in the database to make it look like a well designed dimensional model. This way the ETL can stay simple, the users do not have to write complicated SQL and the BI tools will work (slowly). Why take the time to design a good data model when you can emulate one at one tenth (or less) the scalability.

Consume Excessive Time And Resources Gathering Statistics

Another great way to consume time and compute resources is gathering statistics on data that has not changed or has not significantly changed. If you are regathering statistics on data that has not changed, you should probably consider that Albert Einstein once said “The definition of insanity is doing the same thing over and over again and expecting different results”.

Put The Data Warehouse On A Shared Disk Array

Given that a data warehouse generally requires significant I/O bandwidth and often times use significant disk and compute resources, it would probably be a good idea to share it with other systems. This way the data warehouse (as well as the other applications) will not only have unpredictable performance, but the warehouse will have a scape goat for one to blame when there are issues. Last time I checked, none of the data warehouse appliances or MPP data warehouse databases shared storage with anyone, so maybe they are missing something?