Kevin Closson's Silly Little Benchmark Is Silly Fast On Nehalem

Recently Kevin Closson wrote about the absurdly simple NUMA requirements for TPC-C on the new Intel Nehalem platform and I also mentioned my excitement that databases run 2X faster on Nehalem 5500 series compared to the Intel 5400 series processors. I do have to give credit to Kevin for being excited about Nehalem (the processors, not the river, though that is a very nice fish he caught) way back in March of 2007 when he wrote:

[Nehalem] are quad core processors that are going to pack a very significant punch—much more so than the AMD Barcelona processor expected later this year [in 2007]

I also came across Kevin’s Silly Little Bechmark (SLB) and wanted to give it a run on my new Nehalem Mac Pro and see how it compares to the dual-core Opteron 800 series processor in the DL585 Kevin posted results for on his blog. Unfortunately I only have the entry level Mac Pro with the single 2.66GHz Quad-Core Intel Xeon Nehalem processor and 3 GB of memory. Although, if you have $12,000 burning a hole in your pocket you could snag a 2 x 2.93GHz Quad-Core version and stuff it with 32 GB of memory. I’m quite certain you’ll have to add a 5 point racing harness to your desk chair to operate it though. Don’t take my word for it though – see the Geekbench results. My Mac Pro Geekbench result blows away the PC system it is replacing. Anyway back to the SLB tests…

SLB Test Setup (Memhammer)

Here is the script (./slb) I ran to test out the Mac Pro:

#!/bin/bash

uname -a
date

echo "One memhammer crushing 1GB physical memory:"
./create_sem
./memhammer 262144 6000

echo "Reduce the memory to facilitate 1-4 scale-up test"
echo "1 thread .25GB:"
./create_sem
./memhammer 65536 6000

echo "2 threads .25GB each:"
./create_sem
./memhammer 65536 6000 &
./memhammer 65536 6000 &
./trigger
wait

echo "4 threads .25GB each:"
./create_sem
./memhammer 65536 6000 &
./memhammer 65536 6000 &
./memhammer 65536 6000 &
./memhammer 65536 6000 &
./trigger
wait

SLB Results

Let’s fire off memhammer and make that CPU sweat!

# ./slb
Darwin greg-rahns-mac-pro.local 9.6.3 Darwin Kernel Version 9.6.3: Tue Jan 20 18:26:40 PST 2009; root:xnu-1228.10.33~1/RELEASE_I386 i386
Fri Apr 10 11:48:06 PDT 2009
One memhammer crushing 1GB physical memory:
Total ops 1572864000  Avg nsec/op    30.6  gettimeofday usec 48204156 TPUT ops/sec 32629219.8
Reduce the memory to facilitate 1-4 scale-up test
1 thread .25GB:
Total ops 393216000  Avg nsec/op    30.7  gettimeofday usec 12059811 TPUT ops/sec 32605486.1
2 threads .25GB each:
Total ops 393216000  Avg nsec/op    32.0  gettimeofday usec 12593203 TPUT ops/sec 31224462.9
Total ops 393216000  Avg nsec/op    32.0  gettimeofday usec 12593862 TPUT ops/sec 31222829.0
4 threads .25GB each:
Total ops 393216000  Avg nsec/op    36.6  gettimeofday usec 14391042 TPUT ops/sec 27323664.3
Total ops 393216000  Avg nsec/op    36.6  gettimeofday usec 14394529 TPUT ops/sec 27317045.2
Total ops 393216000  Avg nsec/op    36.7  gettimeofday usec 14423406 TPUT ops/sec 27262354.0
Total ops 393216000  Avg nsec/op    36.7  gettimeofday usec 14449585 TPUT ops/sec 27212961.5

Pretty good stuff! Right around 33 nanoseconds per operation. Blazing fast!

2.66GHz Quad-Core Intel W3520 Nehalem Vs. AMD 2.20GHz Dual-Core Opteron 800 Series

Below are my Mac Pro results compared to the DL585 AMD results Kevin posed.

Command line: ./memhammer 262144 6000

# Mac Pro @ 2.66GHz Quad-Core Intel Xeon Nehalem (model W3520)
Total ops 1572864000 Avg nsec/op 30.7 gettimeofday usec  48292072 TPUT ops/sec 32569818.1

# HP DL585 @ 2.200GHz Dual-Core Opteron 800 series
Total ops 1572864000 Avg nsec/op 68.8 gettimeofday usec 108281130 TPUT ops/sec 14525744.2

The Mac Pro’s Nehalam processor is just over 2X as fast to perform the operation (30.7 ns vs. 68.8 ns) and thus does just over 2X the operations per second (32569818.1 ops/s vs. 14525744.2 ops/s). This Nehalem Mac Pro is simply amazing and definitely a beast on the inside. Can you imagine the compute power Virginia Tech would have if they upgraded their 324 dual-processor quad-core 2.8GHz Penryn Mac Pros to the 2.93GHz Nehalem Mac Pros?

Feel free to run SLB/Memhammer on your system and put the results in a comment.

Intel Nehalem-EP Xeon 5500 Series Processors Makes Databases Go 2X Faster

As a database performance engineer there are certain things that get me really excited. One of them is hardware. Not just any hardware, but the latest, greatest, bleeding edge stuff. It is especially exciting when the latest generation of CPUs are twice as fast as the previous generation, and those being no slouch. This is how Intel’s new Nehalem-EP Xeon 5500 series processors are shaping up.

The big launch was on March 30th so in the past few days all the benchmark reports and blog posts have been rolling in. Here are a few that I think are worth highlighting:

The SQL Server Performance Blog reports:

Pat Gelsinger did a side-by-side performance demo which launched an SSRS report, running reporting queries against a 1.5 TB SSAS OLAP cube, built using a Microsoft adCenter data set. The demo showed how Nehalem-EP is 2X faster than a Xeon 5400 on the same workload, with the same DRAM and I/O configuration. Not too shabby, but we’ve seen even faster results (~3-4X faster) on workloads which are more memory bandwidth-intensive, like data warehousing or in-memory OLAP workloads.

Intel’s Dave Hill over at the The Server Room Blog writes:

As of March 30, 2009, Intel based 2 socket Xeon® 5500 series servers set at least 30 world performance records across a wide range of benchmarks that cover virtually every application type on the market. The performance results, just by themselves, are utterly amazing, and in general they are greater than 2x the Intel® Xeon® 5400 series processors (Harpertown).

There are numerous other benchmarks listed over at the Intel® Xeon® Processor Performance summary page. Check them out. You should be nothing less than amazed. It surely is a great time to be using commodity hardware and if you are not, perhaps you should be! And for those database vendors who are using proprietary hardware like FPGAs, well, I guess you are wishing that Intel’s Nehalem-EP processors are an April Fools’ joke, but you would be wrong.

Larry Ellison Mentions Exadata Performance Numbers

Yesterday Oracle Corporation had its earnings call for F3Q2009. On the call Larry and Charles mention a few of the Exadata performance numbers observed.

Larry Ellison:

…looking forward, I think the most exciting product we’ve had in many, many years is our Exadata Database Server…

Exadata is 100% innovation on top of our very large and very strong database business. And the early results have been remarkable. Charles Phillips will go into a lot of detail but I’ll just throw a couple of numbers out there.

One of our customers, and Charles will describe this customer, one of our customers saw a 28x performance improvement over an existing Oracle database. Another customer saw a monthly aggregation drop from 4.5 hours just to 3 minutes.

When compared to Teradata, a competitive database machine that’s been in the market for a very, very long time, another customer saw that we were 6x faster than their existing Teradata application, when using Exadata versus Teradata.

Another customer saw a batch process fall from 8 hours to 30 minutes. Charles will go into more detail on all this, he will repeat those numbers, because I think they’re worth mentioning twice.

Charles Phillips:

On databases, Larry mentioned, we’re very excited about how the HP Oracle database machine is performing. The increases have just been stunning and so we are getting great feedback from our customers and the pipeline is the largest build I’ve ever seen in terms of a new product.

And as he mentioned, the numbers are just stunning. The major European retailer who reduced the batch processing time from 8 hours to 30 minutes did not believe the process had completed. We had to convince him that’s actually how it’s done.

And so, as Larry mentioned, this is the reminder that this is an internally developed technology in the midst of all the discussion of acquisitions. People forget that we’re actually spending $3.0 billion a year on research and development and this is why we do it.

I agree with Larry 100%. Exadata is the most exciting product and from a database performance engineer perspective (and a hardware junkie) it is quite amazing to see a single 42U rack of HP Oracle Database Machine rip and tear through terabytes of data as if it were breaking the laws of physics. Quite exciting indeed.

The Impact Of Good Table And Query Design

There are many ways to design tables/schemas and many ways to write SQL queries that execute against those tables/schemas. Some designs are better than others for various reasons, however, I think that frequently people underestimate the power of SQL (for both “good” and “evil”). All too often in data warehouses, I see tables designed for one specific report, or a very select few reports. These tables frequently resemble Microsoft Excel Spreadsheets (generally Pivot Tables), not good Dimensional (Star Schema) or Third Normal Form (3NF) schema design. The problem with such designs is that it severely limits the usefulness of that data, as queries that were not known at the time of design often time become problematic. The following is a simple one table example, derived from a field experience in which I discuss two table designs and provide the SQL queries to answer a question the business is seeking.

The Business Question

First lets start with the business question for which the answer is being sought.
What customers meet the following criteria:

  • do not own PRODUCT1 or PRODUCT2 but have downloaded SOFTWARE
  • do not own PRODUCT2 and it has been more than 90 days between SOFTWARE download and their purchase of PRODUCT1

Version 1: The Column Based (Pivot) Table Design

For Version 1, there is a single row for each customer and each attribute has its own column. In this case there are 4 columns, each representing the most recent activity date for that product.

SQL> desc column_tab
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUSTOMER_ID                               NOT NULL NUMBER
 SOFTWARE_MAC_RECENCY_TS                            DATE
 SOFTWARE_WIN_RECENCY_TS                            DATE
 PRODUCT1_RECENCY_TS                                DATE
 PRODUCT2_RECENCY_TS                                DATE

SQL> select * from column_tab;

CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
----------- ---------- ---------- ---------- ----------
        100 2009-03-17            2008-11-17
        200 2009-03-17            2009-01-16
        300 2009-03-17            2008-10-08 2009-02-25
        400            2009-03-17 2008-11-07
        500 2009-03-17

5 rows selected.

SQL> select customer_id
  2  from   column_tab
  3  where  product2_recency_ts is null and
  4         (((software_win_recency_ts is not null or
  5            software_mac_recency_ts is not null) and
  6           product1_recency_ts is null) or
  7          ((software_win_recency_ts - product1_recency_ts) > 90 or
  8           (software_mac_recency_ts - product1_recency_ts) > 90));

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 4293700422

--------------------------------------------------------------------------------
| Id  | Operation         | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            |     2 |    42 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| COLUMN_TAB |     2 |    42 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("PRODUCT2_RECENCY_TS" IS NULL AND ("PRODUCT1_RECENCY_TS"
              IS NULL AND ("SOFTWARE_MAC_RECENCY_TS" IS NOT NULL OR
              "SOFTWARE_WIN_RECENCY_TS" IS NOT NULL) OR
              "SOFTWARE_MAC_RECENCY_TS"-"PRODUCT1_RECENCY_TS">90 OR
              "SOFTWARE_WIN_RECENCY_TS"-"PRODUCT1_RECENCY_TS">90))

As you can see, the query construct to answer the business question is straight forward and requires just one pass over the table.

Version 2: The Row Based (Unpivot) Table, Take 1

In Version 2, there is a single row (tuple) which tracks the customer, product and the recency date. Unlike Version 1, none of the columns can be NULL.

SQL> desc row_tab
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUSTOMER_ID                               NOT NULL NUMBER
 RECENCY_TS                                NOT NULL DATE
 PRODUCT                                   NOT NULL VARCHAR2(32)

SQL> select * from row_tab;

CUSTOMER_ID RECENCY_TS PRODUCT
----------- ---------- --------------------------------
        100 2009-03-17 SOFTWARE_MAC
        200 2009-03-17 SOFTWARE_MAC
        300 2009-03-17 SOFTWARE_MAC
        500 2009-03-17 SOFTWARE_MAC
        400 2009-03-17 SOFTWARE_WIN
        100 2008-11-17 PRODUCT1
        200 2009-01-16 PRODUCT1
        300 2008-10-08 PRODUCT1
        400 2008-11-07 PRODUCT1
        300 2009-02-25 PRODUCT2

10 rows selected.

SQL> select a.customer_id
  2  from   row_tab a,
  3         (select customer_id,
  4                 product,
  5                 recency_ts
  6          from   row_tab
  7          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
  8  where  a.customer_id not in (select customer_id
  9                               from   row_tab
 10                               where  product in ('PRODUCT1', 'PRODUCT2')) and
 11         a.customer_id = b.customer_id
 12  union
 13  select a.customer_id
 14  from   row_tab a,
 15         (select customer_id,
 16                 product,
 17                 recency_ts
 18          from   row_tab
 19          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
 20  where  a.customer_id not in (select customer_id
 21                               from   row_tab
 22                               where  product = 'PRODUCT2') and
 23         a.customer_id = b.customer_id and
 24         (a.product = 'PRODUCT1' and
 25          b.recency_ts - a.recency_ts > 90);

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3517586312

---------------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |    11 |   368 |    22  (60)| 00:00:01 |
|   1 |  SORT UNIQUE          |         |    11 |   368 |    22  (60)| 00:00:01 |
|   2 |   UNION-ALL           |         |       |       |            |          |
|*  3 |    HASH JOIN ANTI     |         |    10 |   310 |    10  (10)| 00:00:01 |
|*  4 |     HASH JOIN         |         |    11 |   187 |     7  (15)| 00:00:01 |
|*  5 |      TABLE ACCESS FULL| ROW_TAB |     5 |    70 |     3   (0)| 00:00:01 |
|   6 |      TABLE ACCESS FULL| ROW_TAB |    10 |    30 |     3   (0)| 00:00:01 |
|*  7 |     TABLE ACCESS FULL | ROW_TAB |     5 |    70 |     3   (0)| 00:00:01 |
|*  8 |    HASH JOIN ANTI     |         |     1 |    58 |    10  (10)| 00:00:01 |
|*  9 |     HASH JOIN         |         |     1 |    44 |     7  (15)| 00:00:01 |
|* 10 |      TABLE ACCESS FULL| ROW_TAB |     4 |    88 |     3   (0)| 00:00:01 |
|* 11 |      TABLE ACCESS FULL| ROW_TAB |     5 |   110 |     3   (0)| 00:00:01 |
|* 12 |     TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   4 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   5 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
   7 - filter("PRODUCT"='PRODUCT1' OR "PRODUCT"='PRODUCT2')
   8 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   9 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
       filter("RECENCY_TS"-"A"."RECENCY_TS">90)
  10 - filter("A"."PRODUCT"='PRODUCT1')
  11 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
  12 - filter("PRODUCT"='PRODUCT2')

Version 2, Take 2

The way the query is written in Version 2, Take 1, it requires six accesses to the table. Partly this is because it uses a UNION. In this case the UNION can be removed and replaced with an OR branch.

SQL> select a.customer_id
  2  from   row_tab a,
  3         (select customer_id,
  4                 product,
  5                 recency_ts
  6          from   row_tab
  7          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
  8  where  a.customer_id = b.customer_id and
  9         ((a.customer_id not in (select customer_id
 10                               from   row_tab
 11                               where  product in ('PRODUCT1', 'PRODUCT2')))
 12         or
 13         ((a.customer_id not in (select customer_id
 14                               from   row_tab
 15                               where  product = 'PRODUCT2') and
 16         (a.product = 'PRODUCT1' and
 17          b.recency_ts - a.recency_ts > 90))))
 18  /

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3327813549

-------------------------------------------------------------------------------
| Id  | Operation           | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |         |     1 |    44 |     7  (15)| 00:00:01 |
|*  1 |  FILTER             |         |       |       |            |          |
|*  2 |   HASH JOIN         |         |    11 |   484 |     7  (15)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| ROW_TAB |     5 |   110 |     3   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
|*  5 |   TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
|*  6 |   TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( NOT EXISTS (SELECT 0 FROM "ROW_TAB" "ROW_TAB" WHERE
              "CUSTOMER_ID"=:B1 AND ("PRODUCT"='PRODUCT1' OR "PRODUCT"='PRODUCT2'))
              OR  NOT EXISTS (SELECT 0 FROM "ROW_TAB" "ROW_TAB" WHERE
              "PRODUCT"='PRODUCT2' AND "CUSTOMER_ID"=:B2) AND
              "A"."PRODUCT"='PRODUCT1' AND "RECENCY_TS"-"A"."RECENCY_TS">90)
   2 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   3 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
   5 - filter("CUSTOMER_ID"=:B1 AND ("PRODUCT"='PRODUCT1' OR
              "PRODUCT"='PRODUCT2'))
   6 - filter("PRODUCT"='PRODUCT2' AND "CUSTOMER_ID"=:B1)

This rewrite brings the table accesses down to four from six, so progress is being made, but I think we can do even better.

Version 2, Take 3

SQL is a very powerful language and there is usually more than one way to structure a query. Version 2, Take 1 uses a very literal translation of the business question and Take 2 just does a mild rewrite changing the UNION to an OR. In Version 2, Take 3, I am going to leverage some different, but very powerful functionality to yield the same results.

SQL> -- COLUMN_TAB can be expressed using ROW_TAB with MAX + CASE WHEN + GROUP BY:
SQL> select   customer_id,
  2           max (case
  3                   when product = 'SOFTWARE_MAC'
  4                      then recency_ts
  5                end) software_mac_recency_ts,
  6           max (case
  7                   when product = 'SOFTWARE_WIN'
  8                      then recency_ts
  9                end) software_win_recency_ts,
 10           max (case
 11                   when product = 'PRODUCT1'
 12                      then recency_ts
 13                end) product1_recency_ts,
 14           max (case
 15                   when product = 'PRODUCT2'
 16                      then recency_ts
 17                end) product2_recency_ts
 18  from     row_tab
 19  group by customer_id;

CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
----------- ---------- ---------- ---------- ----------
        100 2009-03-17            2008-11-17
        200 2009-03-17            2009-01-16
        300 2009-03-17            2008-10-08 2009-02-25
        400            2009-03-17 2008-11-07
        500 2009-03-17

5 rows selected.

SQL> -- The original query can be expressed as follows:
SQL> select customer_id
  2  from   (select   customer_id,
  3                   max (case
  4                           when product = 'SOFTWARE_MAC'
  5                              then recency_ts
  6                        end) software_mac_recency_ts,
  7                   max (case
  8                           when product = 'SOFTWARE_WIN'
  9                              then recency_ts
 10                        end) software_win_recency_ts,
 11                   max (case
 12                           when product = 'PRODUCT1'
 13                              then recency_ts
 14                        end) product1_recency_ts,
 15                   max (case
 16                           when product = 'PRODUCT2'
 17                              then recency_ts
 18                        end) product2_recency_ts
 19          from     row_tab
 20          group by customer_id)
 21  where  product2_recency_ts is null and
 22         (((software_win_recency_ts is not null or
 23            software_mac_recency_ts is not null) and
 24           product1_recency_ts is null) or
 25          ((software_win_recency_ts - product1_recency_ts) > 90 or
 26           (software_mac_recency_ts - product1_recency_ts) > 90)
 27         );

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 825621652

-------------------------------------------------------------------------------
| Id  | Operation           | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |         |     1 |    22 |     4  (25)| 00:00:01 |
|*  1 |  FILTER             |         |       |       |            |          |
|   2 |   HASH GROUP BY     |         |     1 |    22 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(MAX(CASE "PRODUCT" WHEN 'PRODUCT2' THEN "RECENCY_TS" END
              ) IS NULL AND ((MAX(CASE "PRODUCT" WHEN 'SOFTWARE_WIN' THEN
              "RECENCY_TS" END ) IS NOT NULL OR MAX(CASE "PRODUCT" WHEN
              'SOFTWARE_MAC' THEN "RECENCY_TS" END ) IS NOT NULL) AND MAX(CASE
              "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END ) IS NULL OR MAX(CASE
              "PRODUCT" WHEN 'SOFTWARE_WIN' THEN "RECENCY_TS" END )-MAX(CASE
              "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END )>90 OR MAX(CASE
              "PRODUCT" WHEN 'SOFTWARE_MAC' THEN "RECENCY_TS" END )-MAX(CASE
              "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END )>90))

Rewriting the query as a CASE WHEN with a GROUP BY not only cleaned up the SQL, it also resulted in a single pass over the table. Version 2, Take 3 reduces the table access from four to one!

Version 2, Take 4: The PIVOT operator in 11g

In 11g the PIVOT operator was introduced and can simplify the query even more.

SQL> -- In 11g the PIVOT operator can be used, so COLUMN_TAB can be expressed as:
SQL> select *
  2  from row_tab
  3  pivot (max(recency_ts) for product in
  4         ('SOFTWARE_MAC' as software_mac_recency_ts,
  5          'SOFTWARE_WIN' as software_win_recency_ts,
  6          'PRODUCT1' as product1_recency_ts,
  7          'PRODUCT2' as product2_recency_ts));

CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
----------- ---------- ---------- ---------- ----------
        100 2009-03-17            2008-11-17
        200 2009-03-17            2009-01-16
        300 2009-03-17            2008-10-08 2009-02-25
        400            2009-03-17 2008-11-07
        500 2009-03-17

5 rows selected.

SQL> -- Using PIVOT the original query can be expressed as:
SQL> select customer_id
  2  from   row_tab
  3  pivot  (max(recency_ts) for product in
  4         ('SOFTWARE_MAC' as software_mac_recency_ts,
  5          'SOFTWARE_WIN' as software_win_recency_ts,
  6          'PRODUCT1' as product1_recency_ts,
  7          'PRODUCT2' as product2_recency_ts))
  8  where  product2_recency_ts is null and
  9         (((software_win_recency_ts is not null or
 10            software_mac_recency_ts is not null) and
 11           product1_recency_ts is null) or
 12          ((software_win_recency_ts - product1_recency_ts) > 90 or
 13           (software_mac_recency_ts - product1_recency_ts) > 90)
 14         );

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3127820873

--------------------------------------------------------------------------------
| Id  | Operation            | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |         |     1 |    22 |     4  (25)| 00:00:01 |
|*  1 |  FILTER              |         |       |       |            |          |
|   2 |   HASH GROUP BY PIVOT|         |     1 |    22 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS FULL | ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(MAX(CASE  WHEN ("PRODUCT"='PRODUCT2') THEN "RECENCY_TS"
              END ) IS NULL AND ((MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_WIN') THEN
              "RECENCY_TS" END ) IS NOT NULL OR MAX(CASE  WHEN
              ("PRODUCT"='SOFTWARE_MAC') THEN "RECENCY_TS" END ) IS NOT NULL) AND
              MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END ) IS NULL
              OR MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_WIN') THEN "RECENCY_TS" END
              )-MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END )>90 OR
              MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_MAC') THEN "RECENCY_TS" END
              )-MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END )>90))

The Big Picture

One thing that I did not touch on is the flexibility of the ROW_TAB design when it comes to evolution. Any number of products can be added without making any modifications to the loading process. In order to do this with the COLUMN_TAB a new column must be added for each new product. The other major difference between the two table designs is that ROW_TAB is insert only while COLUMN_TAB must be updated if the customer exists. Generally one wants to avoid updated in a data warehouse as 1) old data is usually over written and 2) updates are more expensive than inserts.

The other major thing I won’t discuss in detail is how to partition or index (if required) COLUMN_TAB. Think about this. With ROW_TAB it is very straight forward.

Summary

There are many ways to design tables and write queries. Some of them work well, some do not. Some appear impossible at first, only to appear more simple later. Literal translation of a business question into SQL is usually far from optimal. One needs to think about the question being asked, the shape of the data, and the options available to solve that problem as well as the trade offs of those solutions. Remember: table definitions do not have to look like Spreadsheets. Generally only the output of a query needs to.

Don’t get stuck in SQL-92. It is the year 2009. You should be writing your SQL using the constructs that are provided. Often times very complex data transformations can be done with just SQL. Leverage this power.

All experiments performed on 11.1.0.7

Happy Anniversary To The Structured Data Blog

February 23, 2009 marks the 2nd anniversary of the Structured Data blog. Looking back, I guess I forgot to have a first anniversary, but oh well, I guess I’ll celebrate twice as much this year to make up for it. The past two years have been fun and I’ve enjoyed sharing my experiences and interacting with the readers of this blog. I do wish I had a bit more time to commit to writing but I guess it is what it is. Being a bit of a perfectionist, I’d rather focus on quality than quantity. The past year has been a fun, but busy one, especially with the launch of Oracle Exadata and the HP Oracle Database Machine. I think the rest of 2009 will continue to be a busy and exciting one and I hope to have several more Exadata posts.

The Past Year

Looking back over the past year the top visited blog posts have been:

  1. Choosing An Optimal Stats Gathering Strategy
  2. Troubleshooting Bad Execution Plans
  3. Top Ways How Not To Scale Your Data Warehouse

I’m glad to see the first two of those have been frequently visited because I think they cover topics that are extremely helpful in dealing with the root causes of bad execution plans, a problem that many DBAs face. Contrary to what some “experts” would like to have you believe, there are no silver bullets or magic involved with query plan tuning. It all comes down to understanding how things work and I hope that those posts give you insight into exactly that.

I’ve been meaning to circle back and expand more on #3 ever since I read this comment from Tom Kyte’s blog. Hopefully I will finish up that up in the next few weeks.

The Next Year

One of the main reasons that I started this blog was to share my work experiences with the Oracle community in hopes to explain and help DBAs use the Oracle database software optimally as well as to assist in troubleshooting common, but often difficult issues. In conjunction with my modus operandi, I’d like to offer up the Topic Suggestions page. Feel free to use this page to suggest topics that you would like to have further explained or would like discussion of.

Thanks to all of you who have read and contributed to this blog and may the next year be a great one.

Managing Optimizer Statistics Paper

Over the past couple days I’ve been reading through the recent paper by Karen Morton entitled “Managing Statistics for Optimal Query Performance“. In this paper Karen goes over many of the topics I have discussed as well (and a few that I have not) in the following blog posts:

Overall I think Karen does a good job discussing the key issues related to object statistics and has examples to assist in understanding where and why issues can arise. I’m also a fan of works that do not project an answer, but rather provide information, how to use that information, use working examples, and ultimately leaves the reader with the task of how to best apply the new found knowledge to their environment. I would recommend to all to give it a read.

Comments On Dynamic Sampling

In section 5.1 on page 22 Karen discusses using dynamic sampling, of particular interest she has a table of statistics from Robyn Sands that compares two different runs of some application jobs both using optimizer_dynamic_sampling=4. One uses stats collected with a 100% sample. The other uses no static stats at all, only using dynamic sampling. The data that Robyn has captured looks interesting, but I’m not exactly sure how to interpret all of it (there are a lot of numbers in that chart) and only two sentences that speak to the data. I think I understand what message behind chart but I think it would be easier to see the median along with the average so that one could clearly see there is a large range of values, thus indicating there are numbers far away from the median indicating some vast difference. This type of comparison (before change/after change) is exactly what the Oracle Database 11g SQL Performance Analyzer (SPA) was designed for. One of the reasons I think SQL Performance Analyzer is a great tool for controlled execution and comparison of SQL statements is that not only collects all the execution metrics, but also it captures the execution plan. Looking at the data from Robyn leaves me with the question: For the SQL statements executed, which ones had a plan change, and was that plan change for the better or for worse? The last comment I would make is that while relying 100% on dynamic sampling (having no object statistics) is an interesting data point, I would not recommend this for a production database. Dynamic sampling was designed to augment static statistics and to provide some statistics to the query optimizer in the cases where none exist and only guesses (or selectivity constants) would otherwise be used. The main logic behind this recommendation is that dynamic sampling does not provide all metadata that static statistics provide, it only provides a subset. While this subset of metadata may be enough to get the equivalent plan at times, frequently it is not.

Remember The Chain of Events

Karen mentions (page 5) something that was in the Oracle OpenWorld 2008 session “Real-World Database Performance Techniques and Methods” (slide 29):

Slide29.jpg

Having a good understanding of this chain of events should help you in getting to the root cause of poor execution plans. The key takeaway is to recognize that without representative statistics, the optimizer has very little chance to choose the best execution plan, so give it the best chance you can.

Additional Resources

If you have not looked through the slides from “Real-World Database Performance Techniques and Methods” I might suggest you do. There is some great information based on the Real-World Performance Group’s experience that centers around this topic. Especially note the sections:

  • Optimizer Exposé (slide 7)
  • Managing Statistics on Partitioned Tables (slide 34)

Hopefully you find the information in those slides useful.

A New Look For Structured Data

I’ve decided it was time for a new look for Structured Data so I’ve updated the theme. I think I have looked at enough posts and all looks well but if you find something that appears odd, unreadable or otherwise, do let me know by posting a comment on this thread and include the link to the problematic post.

One of the main reasons I chose this theme was for the java script that allows links back to a comment that one has replied to as well as the option to quote previous comments as well. On the down side the Trackbacks are hidden by default and one has to choose either Trackbacks or Comments to be viewed. I may modify this as time permits. Time came sooner than I had originally anticipated.

Feel free to leave a comment with your thoughts, especially if you think this version of the site lacks something the previous version had.

The Best Benchmarketing I've Seen Yet: Measure BI Queries In Milliseconds

After posting about how ridiculous some of the benchmarketing claims that database vendors are making, Dave Menninger, VP of Marketing & Product Management at Vertica posted a comment that one of their customers reported a 40,400x gain in one query (this of course is after I openly joked about the 16,200x Vertica claim). So I made my way over to check out this claim, and sure enough, someone reported this. Here is the table presented in the webcast:

hMetrix_Vertica.png

To this database performance engineer, this yet another unimpressive performance claim, but rather a very creative use of numbers, or maybe better put, a good case of bad math. Or better yet, big fun with small numbers. Honestly, measuring a BI query response time in milliseconds?!?! I don’t even know if OLTP database users measure their query response time in milliseconds. I simply can’t stop laughing at the fact that there needs to be precision below 1 second. Obviously BI users could not possibly tell that their query ran in less than 1 second because the network latency would mask this. Not only that, it seems there were 154 queries to choose from and the Vertica marketing crew chose to mention this one. Brilliant I say. So yes Dave, this is even more ludicrous than the 16,200x claim. At best it is a 202x gain. You won’t get credit from me (and probably others) for fractional seconds, but thanks for mentioning it. It was a good chuckle. By the way, why add two extra places of precision for this query and not all the others?

I think it is also worth mentioning that the data set size for this case is 84GB (raw) and 10.5GB in the Vertica DB (8x compression). Given the server running the database has 32GB of RAM it easily classifies as an in-memory database, so response time should certainly be in the seconds. I don’t know about you, but performance claims on a database in which the uncompressed data fits on an Apple iPod don’t excite me.

Dave Menninger also mentions:

One other piece of information in an effort of full (or at least more) disclosure is the following blog post that breaks down the orders of magnitude differences between row stores and column stores to their constituent parts.
Debunking Yet Another Myth: Column-Stores As A Storage-Layer Only Optimization

Column stores have been a topic of many research papers. The one that has caught my attention most recently is the paper by Allison Holloway and David DeWitt (Go Badgers!) entitled Read-Optimized Databases, In Depth and the VLDB 2008 presentation which has an alternate title of Yet Another Row Store vs Column Store Paper. I might suggest that you give them a read. Perhaps the crew at The Database Column will offer some comments on Allison and David’s research. I’m surprised that they haven’t already.

Well, that’s enough fun for a Friday. Time to kick off some benchmark queries on my HP Oracle Database Machine.

Database Customer Benchmarketing Reports

A few weeks ago I read Curt Monash’s report on interpreting the results of data warehouse proofs-of-concept (POCs) and I have to say, I’m quite surprised that this topic hasn’t been covered more by analysts in the data warehousing space. I understand that analysts are not database performance engineers, but where do they think that the performance claims of 10x to 100x or more come from? Do they actually investigate these claims or just report on them? I can not say that I have ever seen any database analyst offer any technical insight into these boasts of performance. If some exist be sure to leave a comment and point me to them.

Oracle Exadata Performance Architect Kevin Closson has blogged about a 485x performance increase of Oracle Exadata vs. Oracle Exadata and his follow-up post to explain exactly where the 485x performance gain comes from gave me the nudge to finish this post that had been sitting in my drafts folder since I first read Curt’s post.

Customer Bechmarketing Claims

I thought I would compile a list of what the marketing folks at other database vendors are saying about the performance of their products. Each of these statements have been taken from the given vendor’s website.

  • Netezza: 10-100 times faster than traditional solutions…but it is not uncommon to see performance differences as large as 200x to even 400x or more when compared to existing Oracle systems
  • Greenplum: often 10 to 100 times faster than traditional solutions
  • DATAllegro: 10-100x performance over traditional platforms
  • Vertica: Performs 30x-200x faster than other solutions
  • ParAccel: 20X – 200X performance gains
  • EXASolution: can perform up to 100 times faster than with traditional databases
  • Kognitio WX2: Tests have shown to out-perform other database / data warehouse solutions by 10-60 times

Certainly seems these vendors are a positioning themselves against traditional database solutions, whatever that means. And differences as large as 400x against Oracle? What is it exactly they are comparing?

Investigative Research On Netezza’s Performance Claims

Using my favorite Internet search engine I came across this presentation by Netezza dated October 2007. On slide 21 Netezza is comparing an NPS 8150 (112 SPU, up to 4.5 TB of user data) server to IBM DB2 UDB on a p680 with 12 CPUs (the existing solution). Not being extremely familiar with the IBM hardware mentioned, I thought I’d research to see exactly what an IBM p680 server consists of. The first link in my search results took me to here where the web page states:

The IBM eServer pSeries 680 has been withdrawn from the market, effective March 28, 2003.

Searching a bit more I came across this page which states that the 12 CPUs in the pSeries 680 are RS64 IV microprocessors. According to Wikipedia the “RS64-IV or Sstar was introduced in 2000 at 600 MHz, later increased to 750 MHz”. Given that at best, the p680 had 12 CPUs running at 750 MHz and the NPS 8150 had 112 440GX PowerPC processors I would give the compute advantage to Netezza by a significant margin. I guess it is cool to brag how your most current hardware beat up on some old used and abused server who has already been served its end-of-life notice. I found it especially intriguing that Netezza is boasting about beating out an IBM p680 server that has been end-of-lifed more than four years prior to the presentation’s date. Perhaps they don’t have any more recent bragging to do?

Going back one slide to #20 you will notice a comparison of Netezza and Oracle. Netezza clearly states they used a NPS 8250 (224 SPUs, up to 9 TB of user data) against Oracle 10g RAC running on Sun/EMC. Well ok…Sun/EMC what??? Obviously there were at least 2 Sun servers, since Oracle 10g RAC is involved, but they don’t mention the server models at all, nor the storage, nor the storage connectivity to the hosts. Was this two or more Sun Netra X1s or what??? Netezza boasts a 449x improvement in a “direct comparison on one day’s worth of data”. What exactly is being compared is up to the imagination. I guess this could be one query or many queries, but the marketeers intentionally fail to mention. They don’t even mention the data set size being compared. Given that Netezza can read data off the 224 drives at 60-70 MB/s, the NPS 8250 has a total scan rate of over 13 GB/s. I can tell you first hand that there are very few Sun/EMC solutions that are configured to support 13 GB/s of I/O bandwidth. Most configurations of that vintage probably don’t support 1/10th of that I/O bandwidth (1.3 GB/s).

Here are a few more comparisons that I have seen in Netezza presentations:

  • NPS 8100 (112 SPUs/4.5 TB max) vs. SAS on Sun E5500/6 CPUs/6GB RAM
  • NPS 8100 (112 SPUs/4.5 TB max) vs. Oracle 8i on Sun E6500/12 CPUs/8 GB RAM
  • NPS 8400 (448 SPUs/18 TB max) vs. Oracle on Sun (exact hardware not mentioned)
  • NPS 8100 (112 SPUs/4.5 TB max) vs. IBM SP2 (database not mentioned)
  • NPS 8150z (112 SPUs/5.5 TB max) vs. Oracle 9i on Sun/8 CPUs
  • NPS 8250z (224 SPUs/11 TB max) vs. Oracle 9i on Sun/8 CPUs

As you can see, Netezza has a way of finding the oldest hardware around and then comparing it to its latest, greatest NPS. Just like Netezza slogan, [The Power to ]Question Everything™, I suggest you question these benchmarketing reports. Database software is only as capable as the hardware it runs on and when Netezza targets the worst performing and oldest systems out there, they are bound to get some good marketing numbers. If they compete against the latest, greatest database software running on the latest, greatest hardware, sized competitively for the NPS being used, the results are drastically different. I can vouch for that one first hand having done several POCs against Netezza.

One Benchmarketing Claim To Rule Them All

Now, one of my favorite benchmarketing reports is one from Vertica. Michael Stonebraker’s blog post on customer benchmarks contains the following table:

vertica_benchmark_table.png

Take a good look at the Query 2 results. Vertica takes a query running in the current row store from running in 4.5 hours (16,200 seconds) to 1 second for a performance gain of 16,200x. Great googly moogly batman, that is reaching ludicrous speed. Heck, who needs 100x or 400x when you do 16,200x. That surely warrants an explanation of the techniques involved there. It’s much, much more than simply column store vs. row store. It does raise the question (at least to me): why Vertica doesn’t run every query in 1 second. I mean, come on, why doesn’t that 19 minute row store query score better than a 30x gain? Obviously there is a bit of the magic pixie dust going on here with, what I would refer to as “creative solutions” (in reality it is likely just a very well designed projection/materaizied view, but by showing the query and telling us how it was possible would make it less unimpressive [sic]).

What Is Really Going On Here

First of all, you will notice that not one of these benchmarketing claims is against a vendor run system. Each and every one of these claims are against existing customer systems. The main reason for this is that most vendors prohibit benchmark results being published with out prior consent from the vendor in the licensing agreement. Seems the creative types have found that taking the numbers from the existing, production system is not prohibited in the license agreement so they compare that to their latest, greatest hardware/software and execute or supervise the execution of a benchmark on their solution. Obviously this is a one sided apples to bicycles comparison, but quite favorable for bragging rights for the new guy.

I’ve been doing customer benchmarks and proof of concepts (POCs) for almost 5 years at Oracle. I can guarantee you that Netezza has never even come close to getting 10x-100x the performance over Oracle running on a competitive hardware platform. Now I can say that it is not uncommon for Oracle running on a balanced system to perform 10x to 1000x (ok, in extreme cases) over an existing poorly performing Oracle system. All it takes is to have a very unbalanced system with no I/O bandwidth, not be using parallel query, not use compression, poor or no use of partitioning and you have created a springboard for any vendor to look good.

One More Juicy Marketing Tidbit

While searching the Internet for creative marketing reports I have to admit that the crew at ParAccel probably takes the cake (and not in an impressive way). On one of their web pages they have these bullet points (plus a few more uninteresting ones):

  • All operations are done in parallel (A non-parallel DBMS must scan all of the data sequentially)
  • Adaptive compression makes disks faster…

Ok, so I can kinda, sorta see the point that a non-parallel DBMS must do something sequentially…not sure how else it would do it, but then again, I don’t know any enterprise database that is not capable of parallel operations. However, I’m going to need a bit of help on the second point there…how exactly does compression make disks faster? Disks are disks. Whether or not compression is involved has nothing to do with how fast a disk is. Perhaps they mean that compression can increase the logical read rate from a disk given that compression allows more data to be stored in the same “space” on the disk, but that clearly is not what they have written. Reminds me of DATAllegro’s faster-than-wirespeed claims on scan performance. Perhaps these marketing guys should have their numbers and wording validated by some engineers.

Do You Believe In Magic Or Word Games?

Creditable performance claims need to be accounted for and explained. Neil Raden from Hired Brains Research offers guidance for evaluating benchmarks and interpreting market messaging in his paper, Questions to Ask a Data Warehouse Appliance Vendor. I think Neil shares the same opinion of these silly benchmarketing claims. Give his paper a read.

Glenn Fawcetts's Oracle Analysis 101

Glenn Fawcett has put together a nice slide deck entitled “Oracle Analysis 101” that discusses techniques to collect and analyze performance data related to Oracle databases. It’s a good read so check it out. Glenn is also a great resource for Oracle on Sun Niagara 2 Processors (CMT) so be sure to peek at his other blog posts as well.

Preprocessor For External Tables

Before External Tables existed in the Oracle database, loading from flat files was done via SQL*Loader. One option that some used was to have a compressed text file and load it with SQL*Loader via a named pipe. This allowed one not to have to extract the file, which could be several times the size of the compressed file. As of 11.1.0.7, a similar feature is now available for External Tables (and will be in 10.2.0.5). This enhancement is a result of Bug 6522622 which is mentioned in the Bugs fixed in the 11.1.0.7 Patch Set note. Unfortunately it appears that there aren’t any notes on how to actually use the External Table Preprocessor so allow me to give some insight into its use.

The PREPROCESSOR clause is part of the record_format_info clause. The syntax of the PREPROCESSOR clause is as follows:

PREPROCESSOR [directory_spec:] file_spec [preproc_options_spec]

It’s pretty straight forward when you see an example. Line 31 contains the new clause.

create or replace directory load_dir as '/data/tpc-ds/flat_files/1gb';
create or replace directory log_dir  as '/tmp';
create or replace directory exec_dir as '/bin';
--
-- ET_CUSTOMER_ADDRESS
--
DROP TABLE ET_CUSTOMER_ADDRESS;
CREATE TABLE ET_CUSTOMER_ADDRESS
(
    "CA_ADDRESS_SK"                  NUMBER
   ,"CA_ADDRESS_ID"                  CHAR(16)
   ,"CA_STREET_NUMBER"               CHAR(10)
   ,"CA_STREET_NAME"                 VARCHAR2(60)
   ,"CA_STREET_TYPE"                 CHAR(15)
   ,"CA_SUITE_NUMBER"                CHAR(10)
   ,"CA_CITY"                        VARCHAR2(60)
   ,"CA_COUNTY"                      VARCHAR2(30)
   ,"CA_STATE"                       CHAR(2)
   ,"CA_ZIP"                         CHAR(10)
   ,"CA_COUNTRY"                     VARCHAR2(20)
   ,"CA_GMT_OFFSET"                  NUMBER
   ,"CA_LOCATION_TYPE"               CHAR(20)
)
ORGANIZATION EXTERNAL
(
   TYPE oracle_loader
   DEFAULT DIRECTORY load_dir
   ACCESS PARAMETERS
   (
      RECORDS DELIMITED BY NEWLINE
      PREPROCESSOR exec_dir:'gunzip' OPTIONS '-c'
      BADFILE log_dir: 'CUSTOMER_ADDRESS.bad'
      LOGFILE log_dir: 'CUSTOMER_ADDRESS.log'
      FIELDS TERMINATED BY '|'
      MISSING FIELD VALUES ARE NULL
      (
          "CA_ADDRESS_SK"
         ,"CA_ADDRESS_ID"
         ,"CA_STREET_NUMBER"
         ,"CA_STREET_NAME"
         ,"CA_STREET_TYPE"
         ,"CA_SUITE_NUMBER"
         ,"CA_CITY"
         ,"CA_COUNTY"
         ,"CA_STATE"
         ,"CA_ZIP"
         ,"CA_COUNTRY"
         ,"CA_GMT_OFFSET"
         ,"CA_LOCATION_TYPE"
      )
   )
   LOCATION ('customer_address.dat.gz')
)
REJECT LIMIT UNLIMITED
;

SQL> select count(*) from ET_CUSTOMER_ADDRESS;

  COUNT(*)
----------
     50000

Now let’s double check:

$ gunzip -c customer_address.dat.gz | wc -l
50000

Note: The preprocessor option does not allow the use│, <, >, &, and $ characters due to security reasons.

This is a great enhancement for those who transport compressed files around their networks and want to load them directly into their database via External Tables. One advantage of this feature is that when loading flat files from an NFS staging area, the network traffic is reduced by N, where N is the compression ratio of the file. For example, if your flat file compresses 10x (which is not uncommon), then you get an effective gain of 10x the throughput for the same network bandwidth. Or if you like, the required network bandwidth is reduced 10x to transfer the same logical data set. In this case the compression rate was 4x.

There are a few things to be aware of when using this feature. If the external table is parallel, then the number of files in the External Table Location clause should be equal or greater than the degree of parallelism (DOP). This is because the preprocessor outputs a stream and this stream can not be broken down into granules for multiple Parallel Query Slaves to work on, like a normal uncompressed text file. Each PQ Slave can work on at most, 1 file/stream. For example, if you have a DOP of 16 set on the External Table, but only have 10 files, 10 PQ Slaves will be busy and 6 will be idle, as there are more slaves than files. This means that to get optimal throughput the number of files should be a multiple of the DOP. Obviously this is not always possible so the recommendation is to have more smaller files vs. fewer larger files. This will limit the skew in the workload for the PQ Slaves if/when there are “remainder” files.

Hopefully you will find this enhancement very useful. I sure do.

Oracle Exadata: In Response to Chuck Hollis

Chuck Hollis, VP and Global Marketing CTO at EMC has written a couple blog posts offering his thoughts on Oracle Exadata. The first was “Oracle Does Hardware” which he wrote the day after the product launch. The second, unimpressively titled “I Annoy Kevin Closson at Oracle” was on Monday October 20th which was in response to a blog post by Exadata Performance Architect, Kevin Closson who commented on Chuck’s first post and some comments left on Kevin’s blog.

Clearly Stated Intentions

Since Chuck had disabled comments for his “I Annoy Kevin” post, I’m going to write my comments here. I have no intention to get into some fact-less debate turn flame, but I will make some direct comments with supporting facts and numbers while keeping it professional.

Storage Arrays: Bottleneck or Not?

Chuck thinks:

“…array-based storage technology is not the bottleneck; our work with Oracle [on the Oracle Optimized Warehouse Initiative] and other DW/BI environments routinely shows that we can feed data to a server just as fast as it can take it.”

First let me comment on the Optimized Warehouse Initiative. There have been some good things that have come out of this effort. I believe it has increased the level of awareness when it comes to sizing storage for BI/DW workloads. All too often storage sizing for BI/DW is done by capacity, not I/O bandwidth. The focus is on building balanced systems: systems that can execute queries and workloads such that no one component (CPU/storage connectivity/disk array/disk drives) becomes the bottleneck prematurely. The industry seems to agree: IBM has the Balanced Warehouse and Microsoft has a reference architecture for Project Madison as well.

So the question comes back to: Is array-based storage technology the bottleneck or not? I would argue it is. Perhaps I would use a word other than “bottleneck”, but let’s be clear on the overall challenge here. That is: to read data off disk with speed and efficiently return it to the database host to process it as fast as possible.

Let’s start at the bottom of the stack: hard disk drives. If the challenge is to scan lots of data fast, then how fast data can be read off disk is the first important metric to consider. In the white paper Deploying EMC CLARiiON CX4-960 for Data Warehouse/Decision Support System (DSS) Workloads EMC reports a drive scan rate (for a BI/DW workload) of 20 MB/s using 8+1 RAID-5 and 33 MB/s using a 2+1 RAID-5 LUN configuration. Oracle Exadata delivers drive scan rates around 85 MB/s, a difference of 2.5X to 4.25X. To understand the performance impact of this I’ve put together a few tables of data based on these real workload numbers.

Hardware Specs and Numbers for Data Warehouse Workloads

Storage RAID Raw:Usable Ratio Disk Drives Disk Scan Rate
EMC CX4-960
8+1 RAID 5
9:8
146 GB FC 15k RPM
20 MB/s
EMC CX4-960
2+1 RAID 5
3:2
146 GB FC 15k RPM
33 MB/s
EMC CX4-960
8+1 RAID 5
9:8
300 GB FC 15k RPM
20 MB/s
EMC CX4-960
2+1 RAID 5
3:2
300 GB FC 15k RPM
33 MB/s
Oracle Exadata
ASM Mirroring
2:1
450 GB SAS 15k RPM
85 MB/s

Sizing By Capacity

Storage RAID Total Usable Space Disk Drive Number of Drives Total Scan Rate
EMC CX4-960
8+1 RAID 5
18 TB
146 GB
139
2.8 GB/s
EMC CX4-960
2+1 RAID 5
18 TB
146 GB
185
6.1 GB/s*
EMC CX4-960
8+1 RAID 5
18 TB
300 GB
68
1.4 GB/s
EMC CX4-960
2+1 RAID 5
18 TB
300 GB
90
3.0 GB/s
Oracle Exadata
ASM Mirroring
18 TB
450 GB
80
6.8 GB/s

* I’m not sure that the CX4-960 array head is capable of 6.1 GB/s so it likley takes at least 2 CX4-960 array heads to deliver this throughput to the host(s).

Sizing By Scan Rate

Storage RAID Total Scan Rate Disk Drive Number of Drives Total Usable Space
EMC CX4-960
8+1 RAID 5
3.00 GB/s
146 GB
150
19.46 TB
EMC CX4-960
2+1 RAID 5
3.00 GB/s
146 GB
90
8.76 TB
EMC CX4-960
8+1 RAID 5
3.00 GB/s
300 GB
150
40.00 TB
EMC CX4-960
2+1 RAID 5
3.00 GB/s
300 GB
90
18.00 TB
Oracle Exadata
ASM Mirroring
3.00 GB/s
450 GB
36
8.10 TB

A Few Comments On The Above Data Points

Please note that “Total Usable Space” is a rough number for the total protected disk space one can use for a database if you filled each drive up to capacity. It does not take into consideration things like loss for formatting, space for sort/temp, etc, etc. I would use a 60% rule for estimating data space for database vs. total usable space. This means that 18 TB of total usable space would equate to 10 TB (max) of space for database data (compression not accounted for).

I’d also like to note that in the Sizing By Capacity table the “Total Scan Rate” is a disk only calculation. Whether or not a single CX4-960 array head can move data at that rate is in question. Based on the numbers in the EMC whitepaper it would appear CX4-960 head is capable of 3 GB/s but I would question if it is capable of much more than that, hence the reason for the asterisk(*).

Looking At The Numbers

If you look at the number for Sizing By Capacity, you can see that for the given fixed size, Exadata provides the fastest scan rate while using only 80 disk drives. The next closest scan rate is just 700 MB/s less but it uses 105 more disk drives (80 vs. 185). Quite a big difference.

When it comes to delivering I/O bandwidth, Exadata clearly stands out. Targeting a scan rate of 3 GB/s, Exadata delivers this using only 36 drives, just 3 Exadata Storage Servers. If one wanted to deliver this scan rate with the CX4 it would take 2.5X as many drives (90 vs. 36) using 2+1 RAID 5.

So are storage arrays the bottleneck? You can draw your own conclusions, but I think the numbers speak to the performance advantage with Oracle Exadata when it comes to delivering I/O bandwidth and fast scan rates. Consider this: What would the storage topology look like if you wanted to deliver a scan rate of 74 GB/s as we did for Oracle OpenWorld with 84 HP Oracle Exadata Storage Servers (6 HP Oracle Database Machines)? Honestly I would struggle to think where I would put the 185 or so 4Gb HBAs to achieve that.

Space Saving RAID or Wasteful Mirroring

This leads me to another comment by Chuck in his second post:

“[with Exadata] The disk is mirrored, no support of any space-saving RAID options — strange, for such a large machine”

And this one in his first post:

“If it were me, I’d want a RAID 5 (or 6) option.”

And his comment on Kevin’s blog:

“The fixed ratio of 12 disks (6 usable) per server element strikes us as a bit wasteful….And, I know this only matters to storage people, but there’s the minor matter of having two copies of everything, rather than the more efficient parity RAID approaches. Gets your attention when you’re talking 10-40TB usable, it does.”

Currently Exadata uses ASM mirroring for fault tolerance so there is a 2:1 ratio of raw disk to usable disk, however I don’t think it matters much. The logic behind that comment is that when one is sizing for a given scan rate, Exadata uses less spindles than the other configurations even though the disk protection is mirroring and not space-saving RAID 5. I guess I think it is strange to worry about space savings when disks just keep getting bigger and many are keeping the same performance characteristics as their predecessors. Space is cheap. Spindles are expensive. When one builds a configuration that satisfies the I/O scan rate requirement, chances are you have well exceeded the storage capacity requirement, even when using mirroring.

Perhaps Chuck likes space-saving RAID 5, but I think using less drives (0.4 as many, 36 vs. 90) to deliver the same scan rate is hardly wasteful. You know what really gets my attention? Having 40 TB of total usable space on 15 HP Oracle Exadata Storage Servers (180 450GB SAS drives) and being able to scan it at 15 GB/s compared to say having a CX4 with 200 drives @ 300GB using 2+1 R5 and only being able to scan them at 6.6 GB/s. I’d also be willing to bet that would require at least 2 if not 3 CX4-960 array heads and at least 30 4Gb HBAs running at wire speed (400 MB/s).

Exadata Is Smart Storage

Chuck comments:

“Leaving hardware issues aside, how much of the software functionality shown here is available on generic servers, operating systems and storage that Oracle supports today? I was under the impression that most of this great stuff was native to Oracle products, and not a function of specific tin …

If the Exadata product has unique and/or specialized Oracle logic, well, that’s a different case.”

After reading that I would said Chuck has not read the Technical Overview of the HP Oracle Exadata Storage Server. Not only does Exadata have a very fast scan rate, it has intelligence. A combination of brawn and brains which is not available with other storage platforms. The Oracle Exadata Storage Server Software (say that 5 times fast!!!) is not an Oracle database. It is storage software not database software. The intelligence and specialized logic is that Exadata Smart Scans return only the relevant rows and columns of a query, allowing for better use of I/O bandwidth and increased database performance because the database host(s) are not issuing I/O requests for data that is not needed for the query and then processing it post-fact. There are a couple slides (18 & 19) referencing a simple example of the benifits of Smart Scans in the HP Oracle Exadata Storage Server technical overview slide deck. It is worth the read.

It Will Be Interesting Indeed

Chuck concludes his second post with:

“The real focus here should be software, not hardware.”

Personally I think the focus should be on solutions that perform and scale and I think the HP Oracle Exadata Storage Server is a great solution for Oracle data warehouses that require large amounts of I/O bandwidth.

Ending On A Good Note

While many comments by Chuck do not seem to be well researched I would comment that having a conventional mid-range storage array that can deliver 3 GB/s is not a bad thing at all. I’ve seen many Oracle customers that have only a fraction of that and there are probably some small data warehouses out there that may run fine with 3 GB/s of I/O bandwidth. However, I think that those would run even faster with Oracle Exadata and I’ve never had a customer complain about queries running too fast.