Category: Oracle

Oracle Parallel Execution: Interconnect Myths And Misunderstandings

A number of weeks back I had come across a paper/presentation by Riyaj Shamsudeen entitled Battle of the Nodes: RAC Performance Myths (avaiable here). As I was looking through it I saw one example that struck me as very odd (Myth #3 – Interconnect Performance) and I contacted him about it. After further review Riyaj commented that he had made a mistake in his analysis and offered up a new example. I thought I’d take the time to discuss this as parallel execution seems to be one of those areas where many misconceptions and misunderstandings exist.

The Original Example

I thought I’d quickly discuss why I questioned the initial example. The original query Riyaj cited is this one:

select /*+ full(tl) parallel (tl,4) */
       avg (n1),
       max (n1),
       avg (n2),
       max (n2),
       max (v1)
from   t_large tl;

As you can see this is a very simple single table aggregation without a group by. The reason that I questioned the validity of this example in the context of interconnect performance is that the parallel execution servers (parallel query slaves) will each return exactly one row from the aggregation and then send that single row to the query coordinator (QC) which will then perform the final aggregation. Given that, it would seem impossible that this query could cause any interconnect issues at all.

Riyaj’s Test Case #1

Recognizing the original example was somehow flawed, Riyaj came up with a new example (I’ll reference as TC#1) which consisted of the following query:

select /*+ parallel (t1, 8,2) parallel (t2, 8, 2)  */
       min (t1.customer_trx_line_id + t2.customer_trx_line_id),
       max (t1.set_of_books_id + t2.set_of_books_id),
       avg (t1.set_of_books_id + t2.set_of_books_id),
       avg (t1.quantity_ordered + t2.quantity_ordered),
       max (t1.attribute_category),
       max (t2.attribute1),
       max (t1.attribute2)
from   (select *
        from   big_table
        where  rownum <= 100000000) t1,
       (select *
        from   big_table
        where  rownum <= 100000000) t2
where  t1.customer_trx_line_id = t2.customer_trx_line_id;

The execution plan for this query is:

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |           |     1 |   249 |       |  2846K  (4)| 01:59:01 |        |      |            |
|   1 |  SORT AGGREGATE           |           |     1 |   249 |       |            |          |        |      |            |
|*  2 |   HASH JOIN               |           |   100M|    23G|   762M|  2846K  (4)| 01:59:01 |        |      |            |
|   3 |    VIEW                   |           |   100M|    10G|       |  1214K  (5)| 00:50:46 |        |      |            |
|*  4 |     COUNT STOPKEY         |           |       |       |       |            |          |        |      |            |
|   5 |      PX COORDINATOR       |           |       |       |       |            |          |        |      |            |
|   6 |       PX SEND QC (RANDOM) | :TQ10000  |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | P->S | QC (RAND)  |
|*  7 |        COUNT STOPKEY      |           |       |       |       |            |          |  Q1,00 | PCWC |            |
|   8 |         PX BLOCK ITERATOR |           |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | PCWC |            |
|   9 |          TABLE ACCESS FULL| BIG_TABLE |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | PCWP |            |
|  10 |    VIEW                   |           |   100M|    12G|       |  1214K  (5)| 00:50:46 |        |      |            |
|* 11 |     COUNT STOPKEY         |           |       |       |       |            |          |        |      |            |
|  12 |      PX COORDINATOR       |           |       |       |       |            |          |        |      |            |
|  13 |       PX SEND QC (RANDOM) | :TQ20000  |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | P->S | QC (RAND)  |
|* 14 |        COUNT STOPKEY      |           |       |       |       |            |          |  Q2,00 | PCWC |            |
|  15 |         PX BLOCK ITERATOR |           |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | PCWC |            |
|  16 |          TABLE ACCESS FULL| BIG_TABLE |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."n1"="T2"."n1")
   4 - filter(ROWNUM<=100000000)
   7 - filter(ROWNUM<=100000000)
  11 - filter(ROWNUM<=100000000)
  14 - filter(ROWNUM<=100000000)

This is a rather synthetic query but there are a few things that I would like to point out. First, this query uses a parallel hint with 3 values representing table/degree/instances, however instances has been deprecated (see 10.2 parallel hint documentation). In this case the DOP is calculated by degree * instances or 16, not DOP=8 involving 2 instances. Note that the rownum filter is causing all the rows from the tables to be sent back to the QC for the COUNT STOPKEY operation thus causing the execution plan to serialize, denoted by the P->S in the IN-OUT column.

Riyaj had enabled sql trace for the QC and the TKProf output is such:

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  SORT AGGREGATE (cr=152 pr=701158 pw=701127 time=1510221226 us)
98976295   HASH JOIN  (cr=152 pr=701158 pw=701127 time=1244490336 us)
100000000    VIEW  (cr=76 pr=0 pw=0 time=200279054 us)
100000000     COUNT STOPKEY (cr=76 pr=0 pw=0 time=200279023 us)
100000000      PX COORDINATOR  (cr=76 pr=0 pw=0 time=100270084 us)
      0       PX SEND QC (RANDOM) :TQ10000 (cr=0 pr=0 pw=0 time=0 us)
      0        COUNT STOPKEY (cr=0 pr=0 pw=0 time=0 us)
      0         PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us)
      0          TABLE ACCESS FULL BIG_TABLE_NAME_CHANGED_12 (cr=0 pr=0 pw=0 time=0 us)
100000000    VIEW  (cr=76 pr=0 pw=0 time=300298770 us)
100000000     COUNT STOPKEY (cr=76 pr=0 pw=0 time=200298726 us)
100000000      PX COORDINATOR  (cr=76 pr=0 pw=0 time=200279954 us)
      0       PX SEND QC (RANDOM) :TQ20000 (cr=0 pr=0 pw=0 time=0 us)
      0        COUNT STOPKEY (cr=0 pr=0 pw=0 time=0 us)
      0         PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us)
      0          TABLE ACCESS FULL BIG_TABLE_NAME_CHANGED_12 (cr=0 pr=0 pw=0 time=0 us)

Note that the Rows column contains zeros for many of the row sources because this trace is only for the QC, not the slaves, and thus only QC rows will show up in the trace file. Something to be aware of if you decide to use sql trace with parallel execution.

Off To The Lab: Myth Busting Or Bust!

I wanted to take a query like TC#1 and run it in my own environment so I could do more monitoring of it. Given the alleged myth had to do with interconnect traffic of cross-instance (inter-node) parallel execution, I wanted to be certain to gather the appropriate data. I ran several tests using a similar query on a similar sized data set (by row count) as the initial example. I ran all my experiments on a Oracle Real Application Clusters version 11.1.0.7 consisting of eight nodes, each with two quad-core CPUs. The interconnect is InfiniBand and the protocol used is RDS (Reliable Datagram Sockets).

Before I get into the experiments I think it is worth mentioning that Oracle's parallel execution (PX), which includes Parallel Query (PQ), PDML & PDDL, can consume vast amounts of resources. This is by design. You see, the idea of Oracle PX is to dedicate a large amount of resources (processes) to a problem by breaking it up into many smaller pieces and then operate on those pieces in parallel. Thus the more parallelism that is used to solve a problem, the more resources it will consume, assuming those resources are available. That should be fairly obvious, but I think it is worth stating.

For my experiments I used a table that contains just over 468M rows.

Below is my version of TC#1. The query is a self-join on a unique key and the table is range partitioned by DAY_KEY into 31 partitions. Note that I create a AWR snapshot immediately before and after the query.

exec dbms_workload_repository.create_snapshot

select /* &amp;&amp;1 */
       /*+ parallel (t1, 16) parallel (t2, 16) */
       min (t1.bsns_unit_key + t2.bsns_unit_key),
       max (t1.day_key + t2.day_key),
       avg (t1.day_key + t2.day_key),
       max (t1.bsns_unit_typ_cd),
       max (t2.curr_ind),
       max (t1.load_dt)
from   dwb_rtl_trx t1,
       dwb_rtl_trx t2
where  t1.trx_nbr = t2.trx_nbr;

exec dbms_workload_repository.create_snapshot

Experiment Results Using Fixed DOP=16

I ran my version of TC#1 across a varying number of nodes by using Oracle services (instance_groups and parallel_instance_group have been deprecated in 11g), but kept the DOP constant at 16 for all the tests. Below is a table of the experiment results.

Nodes
Elapsed Time
SQL Monitor Report
AWR Report
AWR SQL Report
1
00:04:54.12
2
00:03:55.35
4
00:02:59.24
8
00:02:14.39

Seemingly contrary to what many people would probably guess, the execution times got better the more nodes that participated in the query even though the DOP constant throughout each of tests.

Measuring The Interconnect Traffic

One of the new additions to the AWR report in 11g was the inclusion of interconnect traffic by client. This section is near the bottom of the report and looks like such:
Interconnect Throughput By Client
This allows the PQ message traffic to be tracked, whereas in prior releases it was not.

Even though AWR contains the throughput numbers (as in megabytes per second) I thought it would be interesting to see how much data was being transferred, so I used the following query directly against the AWR data. I put a filter predicate on to return only where there DIFF_RECEIVED_MB >= 10MB so the instances that were not part of the execution are filtered out, as well as the single instance execution.

break on snap_id skip 1
compute sum of DIFF_RECEIVED_MB on SNAP_ID
compute sum of DIFF_SENT_MB on SNAP_ID

select *
from   (select   snap_id,
                 instance_number,
                 round ((bytes_sent - lag (bytes_sent, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_sent_mb,
                 round ((bytes_received - lag (bytes_received, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_received_mb
        from     dba_hist_ic_client_stats
        where    name = 'ipq' and
                 snap_id between 910 and 917
        order by snap_id,
                 instance_number)
where  snap_id in (911, 913, 915, 917) and
       diff_received_mb &gt;= 10
/

SNAP_ID    INSTANCE_NUMBER DIFF_SENT_MB DIFF_RECEIVED_MB
---------- --------------- ------------ ----------------
       913               1        11604            10688
                         2        10690            11584
**********                 ------------ ----------------
sum                               22294            22272

       915               1         8353             8350
                         2         8133             8418
                         3         8396             8336
                         4         8514             8299
**********                 ------------ ----------------
sum                               33396            33403

       917               1         5033             4853
                         2         4758             4888
                         3         4956             4863
                         4         5029             4852
                         5         4892             4871
                         6         4745             4890
                         7         4753             4889
                         8         4821             4881
**********                 ------------ ----------------
sum                               38987            38987

As you can see from the data, the more nodes that were involved in the execution, the more interconnect traffic there was, however, the execution times were best with 8 nodes.

Further Explanation Of Riyaj's Issue

If you read Riyaj's post, you noticed that he observed worse, not better as I did, elapsed times when running on two nodes versus one. How could this be? It was noted in the comment thread of that post that the configuration was using Gig-E as the interconnect in a Solaris IPMP active-passive configuration. This means the interconnect speeds would be capped at 128MB/s (1000Mbps), the wire speed of Gig-E. This is by all means is an inadequate configuration to use cross-instance parallel execution.

There is a whitepaper entitled Oracle SQL Parallel Execution that discusses many of the aspects of Oracle's parallel execution. I would highly recommend reading it. This paper specifically mentions:

If you use a relatively weak interconnect, relative to the I/O bandwidth from the server to the storage configuration, then you may be better of restricting parallel execution to a single node or to a limited number of nodes; inter-node parallel execution will not scale with an undersized interconnect.

I would assert that this is precisely the root cause (insufficient interconnect bandwidth for cross-instance PX) behind the issues that Riyaj observed, thus making his execution slower on two nodes than one node.

The Advantage Of Hash Partitioning/Subpartitioning And Full Partition-Wise Joins

At the end of my comment on Riyaj's blog, I mentioned:

If a DW frequently uses large table to large table joins, then hash partitioning or subpartitioning would yield added gains as partition-wise joins will be used.

I thought that it would be both beneficial and educational to extend TC#1 and implement hash subpartitioning so that the impact could be measured on both query elapsed time and interconnect traffic. In order for a full partition-wise join to take place, the table must be partitioned/subpartitioned on the join key column, so in this case I've hash subpartitioned on TRX_NBR. See the Oracle Documentation on Partition-Wise Joins for a more detailed discussion on PWJ.

Off To The Lab: Partition-Wise Joins

I've run through the exact same test matrix with the new range/hash partitioning model and below are the results.

Nodes
Elapsed Time
SQL Monitor Report
AWR Report
AWR SQL Report
1
00:02:42.41
2
00:01:37.29
4
00:01:12.82
8
00:01:05.04

As you can see by the elapsed times, the range/hash partitioning model with the full partition-wise join has decreased the overall execution time by around a factor of 2X compared to the range only partitioned version.

Now let's take a look at the interconnect traffic for the PX messages:

break on snap_id skip 1
compute sum of DIFF_RECEIVED_MB on SNAP_ID
compute sum of DIFF_SENT_MB on SNAP_ID

select *
from   (select   snap_id,
                 instance_number,
                 round ((bytes_sent - lag (bytes_sent, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_sent_mb,
                 round ((bytes_received - lag (bytes_received, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_received_mb
        from     dba_hist_ic_client_stats
        where    name = 'ipq' and
                 snap_id between 1041 and 1048
        order by snap_id,
                 instance_number)
where  snap_id in (1042,1044,1046,1048) and
       diff_received_mb &gt;= 10
/
no rows selected

Hmm. No rows selected?!? I had previously put in the predicate "DIFF_RECEIVED_MB >= 10MB" to filter out the nodes that were not participating in the parallel execution. Let me remove that predicate rerun the query.

   SNAP_ID INSTANCE_NUMBER DIFF_SENT_MB DIFF_RECEIVED_MB
---------- --------------- ------------ ----------------
      1042               1            8                6
                         2            2                3
                         3            2                3
                         4            2                3
                         5            2                3
                         6            2                3
                         7            2                3
                         8            2                3
**********                 ------------ ----------------
sum                                  22               27

      1044               1            7                7
                         2            3                2
                         3            2                2
                         4            2                2
                         5            2                2
                         6            2                2
                         7            2                2
                         8            2                2
**********                 ------------ ----------------
sum                                  22               21

      1046               1            1                2
                         2            1                2
                         3            1                2
                         4            3                1
                         5            1                1
                         6            1                1
                         7            1                1
                         8            1                1
**********                 ------------ ----------------
sum                                  10               11

      1048               1            6                5
                         2            1                2
                         3            3                2
                         4            1                2
                         5            1                2
                         6            1                2
                         7            1                2
                         8            1                2
**********                 ------------ ----------------
sum                                  15               19

Wow, there is almost no interconnect traffic at all. Let me verify with the AWR report from the 8 node execution.

The AWR report confirms that there is next to no interconnect traffic for the PWJ version of TC#1. The reason for this is that since the table is hash subpartitoned on the join column each of the subpartitions can be joined to each other minimizing the data sent between parallel execution servers. If you look at the execution plan (see the AWR SQL Report) from the first set of experiments you will notice that the broadcast method for each of the tables is HASH but in the range/hash version of TC#1 there is no broadcast at all for either of the two tables. The full partition-wise join behaves logically the same way that a shared-nothing database would; each of the parallel execution servers works on its partition which does not require data from any other partition because of the hash partitioning on the join column. The main difference is that in a shared-nothing database the data is physically hash distributed amongst the nodes (each node contains a subset of all the data) where as all nodes in a Oracle RAC database have access to the all the data.

Parting Thoughts

Personally I see no myth about cross-instance (inter-node) parallel execution and interconnect traffic, but frequently I see misunderstandings and misconceptions. As shown by the data in my experiment, TC#1 (w/o hash subpartitioning) running on eight nodes is more than 2X faster than running on one node using exactly the same DOP. Interconnect traffic is not a bad thing as long as the interconnect is designed to support the workload. Sizing the interconnect is really no different than sizing any other component of your cluster (memory/CPU/disk space/storage bandwidth). If it is undersized, performance will suffer. Depending on the number and speed of the host CPUs and the speed and bandwidth of the interconnect, your results may vary.

By hash subpartioning the table the interconnect traffic was all but eliminated and the query execution times were around 2X faster than the non-hash subpartition version of TC#1. This is obviously a much more scalable solution and one of the main reasons to leverage hash (sub)partitioning in a data warehouse.

Oracle Press Release: Customers are Choosing the Oracle Database Machine

Oracle put out a press release today entitled “Customers are Choosing the Oracle Database Machine” mentioning the new Exadata and Oracle Database Machine customers. I’ve quoted a few parts of it below. Oracle cites twenty initial customers.

Initial Customers

Initial Oracle Exadata customers including Amtrak, Allegro Group, Automobile Association of the UK, CTC, Garanti Bank, Giant Eagle, HISCOM (Hokuriku Coca Cola), KnowledgeBase Marketing, Loyalty Partner Solutions, M-Tel, MTN Group, Nagase, NS Solutions, NTT Data, OK Systems, Research in Motion, SoftBank Mobile, Screwfix, ThomsonReuters, and True Telecom, confirm the benefits Oracle Exadata products bring to their Oracle data warehouses.

Supporting Quotes

  • “The HP Oracle Database Machine beat the competing solutions we tested on bandwidth, load rate, disk capacity, and transparency. In addition, Allegro Group saw a significant performance boost from the new data warehouse. A query that used to take 24 hours to complete now runs in less than 30 minutes on the HP Oracle Database Machine, and that’s without any manual query tuning.” — Christian Maar, CIO of Poznań, Poland-based Allegro Group
  • “After carefully testing various options for a new data warehouse platform we chose the HP Oracle Database Machine over Netezza. Oracle Exadata was able to speed up one of our critical processes from days to minutes. The HP Oracle Database Machine will allow us to improve service levels and expand our service offerings. We also plan to consolidate our current data warehouse solutions onto the Oracle Exadata platform. This should eliminate several servers and a number of storage arrays and help reduce our operating overhead and improve margins.” — Brian Camp, Sr. VP of Infrastructure Services, KnowledgeBase Marketing
  • “We anticipate the move of our Data Warehouse to Oracle Database 11g running on our first HP Oracle Database Machine with Oracle Exadata will deliver a substantial boost in performance and scalability, simply and easily. Our business users expect to benefit from faster access to information more quickly than ever before. The resulting agility should make a huge difference to our business.” — Andreas Berninger, Chief Operating, Loyalty Partner Solutions
  • “The biggest technological challenge we face when we architect a database is how to create a system that performs fast with huge volumes. Oracle Exadata helps solve our performance demands. It's highly available and reliable, and it can essentially scale linearly. All of the queries we tested were faster with Oracle Exadata. The smallest performance boost we experienced was 10 times; the fastest was 72 times faster.” — Plamen Zyumbyulev, Head of Database Administration, M-Tel
  • “A key component of RIM's manufacturing process is extensive testing of each handheld device. This testing generates large volumes of data, which is extensively analyzed by our quality and test engineers and business users to ensure RIM is producing the highest quality devices for our customers. The HP Oracle Database Machine is an ideal platform to store and analyze this data since it provides the performance, scalability and storage capacity for our requirements. It’s a cost-effective platform to meet our speed and scalability needs and is an integral component used for analysis in our manufacturing process.” — Ketan Parekh, Manager Database Systems, Research in Motion
  • “The benchmark result of Oracle Exadata was amazing! Since it is based on Oracle Database 11g, we determined that it is compatible with other systems and the most suitable solution for our increasing data infrastructure.” — Keiichiro Shimizu, General Manager, Business Base Management Dept., Information System Div., SoftBank Mobile Corp.
  • “Oracle Exadata is among the most successful new product introductions in Oracle's history,” said Willie Hardie, vice president of Database Product Marketing, Oracle. “Repeatedly in customer proof of concepts and benchmarks, Oracle Exadata has delivered extreme performance for customers' data warehouses.”

  • Exadata Snippits From Oracle F4Q09 Earnings Call

    Oracle Corporation had its F4Q09 earnings call today and the Exadata comments started right away with the earnings press release:

    “The Exadata Database Machine is well on its way to being the most successful new product launch in Oracle’s 30 year history,” said Oracle CEO Larry Ellison. “Several of Teradata’s largest customers are performance testing — then buying — Oracle Exadata Database Machines. In a recent competitive benchmark, a Teradata machine took over six hours to process a query that our Exadata Database Machine ran in less than 30 minutes. They bought Exadata.”

    During the earnings call Larry Ellison discusses Exadata and the competition:

    …I’m going to talk about Exadata again. I said last quarter that Exadata is shaping up to be our most exciting and successful new product introduction in Oracle’s 30 year history and [in the] last quarter Exadata continues to grow and win competitive deals in the marketplace against our three primarily competitors. It’s turning out that Teradata is our number one competitor…Netezza and IBM are kind of tied for second.

    Ellison describes some of the Exadata sales from this quarter which include:

    • A well-known California SmartPhone and computer manufacturer (win vs. Netezza) who commented that Exadata ran about 100 times faster in some cases then their standard Oracle environment
    • Research in Motion
    • Amtrak
    • A large East Coast insurance company
    • Thomson Reuters
    • A Japanese telco (biggest Teradata customer in Japan) who benchmarked Exadata and found it to be dramatically faster then Teradata
    • Barclays Capital (UK)
    • A number of banks in Western Europe and Germany

    Larry Ellison follows with:

    It was just a great quarter for Exadata, a product that is relatively new to the marketplace that is persuading people to move from their existing environments because Exadata is faster and the hardware costs less.

    In the Q&A Larry Ellison responds to John DiFucci on Exadata:

    By the way every customer I mentioned and alluded to were actual sales. Now some of these, because the Exadata product is so new, quite often will install in kind of a try and buy situation, but I can’t think of a case where we installed the machine that they didn’t buy. So we’re winning these benchmarks. Sometimes we’re beating Teradata. I think in my quote, I said we’ve beat Teradata on one of the queries by 20 to one. So we think it’s a brand new technology, we think we’re a lot faster then the competition. The benchmarks are proving out with real customer data, we’re proving to be much faster then the competition. Every single deal I mentioned were cases where the customer bought the system. There are obviously other evaluations going on and we expect the Exadata sales to accelerate.

    Facebook: Hive – A Petabyte Scale Data Warehouse Using Hadoop

    Today, June 10th, marks the Yahoo! Hadoop Summit ’09 and the crew at Facebook have a writeup on the Facebook Engineering page entitled: Hive – A Petabyte Scale Data Warehouse Using Hadoop.

    I found this an very interesting read given some of the Hadoop/MapReduce comments from David J. DeWitt and Michael Stonebraker as well as their SIGMOD 2009 paper, A Comparison of Approaches to Large-Scale Data Analysis. Now I’m not about to jump into this whole dbms-is-better-than-mapreduce argument but I found Facebook’s story line interesting:

    When we started at Facebook in 2007 all of the data processing infrastructure was built around a data warehouse built using a commercial RDBMS. The data that we were generating was growing very fast – as an example we grew from a 15TB data set in 2007 to a 2PB data set today. The infrastructure at that time was so inadequate that some daily data processing jobs were taking more than a day to process and the situation was just getting worse with every passing day. We had an urgent need for infrastructure that could scale along with our data and it was at that time we then started exploring Hadoop as a way to address our scaling needs.

    [The] Hive/Hadoop cluster at Facebook stores more than 2PB of uncompressed data and routinely loads 15 TB of data daily

    Wow, 2PB of uncompressed data and growing at around 15TB daily. A part of me wonders how much value there is in 2PB of data or if companies are suffering from OCD when it comes to data. Either way it’s interesting to see how much data is being generated/collected and how engineers are dealing with it.

    Oracle And HP Take Back #1 Spot For 1TB TPC-H Benchmark

    Oracle and HP have taken back the #1 spot by setting a new performance record in the 1TB TPC-H benchmark. The HP/Oracle result puts the Oracle database ahead of both the Exasol (currently #2 & #3) and ParAccel (currently #4) results in the race for performance at the 1TB scale factor and places Oracle in the >1 million queries per hour (QphH) club, which is no small achievement. Compared to the next best result from HP/Oracle (currently #5), this result has over 9X the query throughput (1,166,976 QphH vs. 123,323 QphH) at around 1/4 the cost (5.42 USD vs. 20.54 USD) demonstrating significantly more performance for the money.

    Some of the interesting bits from the hardware side:

    • 4 HP BladeSystem c7000 Enclosures
    • 64 HP ProLiant BL460c Servers
    • 128 Quad-Core Intel Xeon X5450 “Harpertown” Processors (512 cores)
    • 2TB Total System Memory (RAM)
    • 6 HP Oracle Exadata Storage Servers

    As you can see, this was a 64 node Oracle Real Application Cluster (RAC), each node having 2 processors (8 cores). This is also the first TPC-H benchmark from Oracle that used Exadata as the storage platform.

    Congratulation to the HP/Oracle team on the great accomplishment!

    Transaction Processing Performance Council_1244094205417.png

    Blogroll Additions: Alberto Dell'Era & Christian Antognini

    Fellow OakTable Network member, Alberto Dell’Era, has starting blogging and I wanted mention it to my readers as well as add it to my blogroll. I’ve followed Alberto’s works for some time and always was impressed by his thoroughness in investigation as well as explanation. Most recently I read his solution to the First International NoCOUG SQL Challenge and it immediately made me jealous as I’ve always liked math challenges, but my time has been dominated by HP Oracle Database Machine activities since its launch. Great solution Alberto! Often I wonder if programmers today even know what big O is?

    I’m also adding Christian Antognini, author of Troubleshooting Oracle Performance and fellow OakTable Network member, to my blogroll as well. If you have not read Christian’s book, I would recommend you do. I would classify it as a “need to read” book for Oracle professionals. It is a great resource for Oracle troubleshooting: no silver bullets included or required.

    Oracle Sun FAQ

    On Monday, April 20, 2009, Oracle announced that it had agreed to acquire Sun Microsystems. Since then there has been much speculation and question raised around numerous areas of the deal. There is an official FAQ that discusses many areas, but I thought I would highlight three that seem to be fairly popular around the blogosphere:

    Will the ownership of Solaris change Oracle’s position on Linux?
    No. This transaction enhances our commitment to open standards and choice. Oracle is as committed as ever to Linux and other platforms and will continue to support and enhance our strong industry partnerships.

    What does Oracle plan to do with MySQL?
    MySQL will be an addition to Oracle’s existing suite of database products, which already includes Oracle Database 11g, TimesTen, Berkeley DB open source database, and the open source transactional storage engine, InnoDB.

    What impact does this announcement have on the HP Oracle Exadata Storage Server and HP Oracle Database Machine products?
    There is no impact. Oracle remains fully committed to the HP Oracle Exadata Storage Server and HP Oracle Database Machine products.

    Kevin Closson's Silly Little Benchmark Is Silly Fast On Nehalem

    Recently Kevin Closson wrote about the absurdly simple NUMA requirements for TPC-C on the new Intel Nehalem platform and I also mentioned my excitement that databases run 2X faster on Nehalem 5500 series compared to the Intel 5400 series processors. I do have to give credit to Kevin for being excited about Nehalem (the processors, not the river, though that is a very nice fish he caught) way back in March of 2007 when he wrote:

    [Nehalem] are quad core processors that are going to pack a very significant punch—much more so than the AMD Barcelona processor expected later this year [in 2007]

    I also came across Kevin’s Silly Little Bechmark (SLB) and wanted to give it a run on my new Nehalem Mac Pro and see how it compares to the dual-core Opteron 800 series processor in the DL585 Kevin posted results for on his blog. Unfortunately I only have the entry level Mac Pro with the single 2.66GHz Quad-Core Intel Xeon Nehalem processor and 3 GB of memory. Although, if you have $12,000 burning a hole in your pocket you could snag a 2 x 2.93GHz Quad-Core version and stuff it with 32 GB of memory. I’m quite certain you’ll have to add a 5 point racing harness to your desk chair to operate it though. Don’t take my word for it though – see the Geekbench results. My Mac Pro Geekbench result blows away the PC system it is replacing. Anyway back to the SLB tests…

    SLB Test Setup (Memhammer)

    Here is the script (./slb) I ran to test out the Mac Pro:

    #!/bin/bash
    
    uname -a
    date
    
    echo "One memhammer crushing 1GB physical memory:"
    ./create_sem
    ./memhammer 262144 6000
    
    echo "Reduce the memory to facilitate 1-4 scale-up test"
    echo "1 thread .25GB:"
    ./create_sem
    ./memhammer 65536 6000
    
    echo "2 threads .25GB each:"
    ./create_sem
    ./memhammer 65536 6000 &
    ./memhammer 65536 6000 &
    ./trigger
    wait
    
    echo "4 threads .25GB each:"
    ./create_sem
    ./memhammer 65536 6000 &
    ./memhammer 65536 6000 &
    ./memhammer 65536 6000 &
    ./memhammer 65536 6000 &
    ./trigger
    wait
    
    

    SLB Results

    Let’s fire off memhammer and make that CPU sweat!

    # ./slb
    Darwin greg-rahns-mac-pro.local 9.6.3 Darwin Kernel Version 9.6.3: Tue Jan 20 18:26:40 PST 2009; root:xnu-1228.10.33~1/RELEASE_I386 i386
    Fri Apr 10 11:48:06 PDT 2009
    One memhammer crushing 1GB physical memory:
    Total ops 1572864000  Avg nsec/op    30.6  gettimeofday usec 48204156 TPUT ops/sec 32629219.8
    Reduce the memory to facilitate 1-4 scale-up test
    1 thread .25GB:
    Total ops 393216000  Avg nsec/op    30.7  gettimeofday usec 12059811 TPUT ops/sec 32605486.1
    2 threads .25GB each:
    Total ops 393216000  Avg nsec/op    32.0  gettimeofday usec 12593203 TPUT ops/sec 31224462.9
    Total ops 393216000  Avg nsec/op    32.0  gettimeofday usec 12593862 TPUT ops/sec 31222829.0
    4 threads .25GB each:
    Total ops 393216000  Avg nsec/op    36.6  gettimeofday usec 14391042 TPUT ops/sec 27323664.3
    Total ops 393216000  Avg nsec/op    36.6  gettimeofday usec 14394529 TPUT ops/sec 27317045.2
    Total ops 393216000  Avg nsec/op    36.7  gettimeofday usec 14423406 TPUT ops/sec 27262354.0
    Total ops 393216000  Avg nsec/op    36.7  gettimeofday usec 14449585 TPUT ops/sec 27212961.5
    
    

    Pretty good stuff! Right around 33 nanoseconds per operation. Blazing fast!

    2.66GHz Quad-Core Intel W3520 Nehalem Vs. AMD 2.20GHz Dual-Core Opteron 800 Series

    Below are my Mac Pro results compared to the DL585 AMD results Kevin posed.

    Command line: ./memhammer 262144 6000
    
    # Mac Pro @ 2.66GHz Quad-Core Intel Xeon Nehalem (model W3520)
    Total ops 1572864000 Avg nsec/op 30.7 gettimeofday usec  48292072 TPUT ops/sec 32569818.1
    
    # HP DL585 @ 2.200GHz Dual-Core Opteron 800 series
    Total ops 1572864000 Avg nsec/op 68.8 gettimeofday usec 108281130 TPUT ops/sec 14525744.2
    

    The Mac Pro’s Nehalam processor is just over 2X as fast to perform the operation (30.7 ns vs. 68.8 ns) and thus does just over 2X the operations per second (32569818.1 ops/s vs. 14525744.2 ops/s). This Nehalem Mac Pro is simply amazing and definitely a beast on the inside. Can you imagine the compute power Virginia Tech would have if they upgraded their 324 dual-processor quad-core 2.8GHz Penryn Mac Pros to the 2.93GHz Nehalem Mac Pros?

    Feel free to run SLB/Memhammer on your system and put the results in a comment.

    Intel Nehalem-EP Xeon 5500 Series Processors Makes Databases Go 2X Faster

    As a database performance engineer there are certain things that get me really excited. One of them is hardware. Not just any hardware, but the latest, greatest, bleeding edge stuff. It is especially exciting when the latest generation of CPUs are twice as fast as the previous generation, and those being no slouch. This is how Intel’s new Nehalem-EP Xeon 5500 series processors are shaping up.

    The big launch was on March 30th so in the past few days all the benchmark reports and blog posts have been rolling in. Here are a few that I think are worth highlighting:

    The SQL Server Performance Blog reports:

    Pat Gelsinger did a side-by-side performance demo which launched an SSRS report, running reporting queries against a 1.5 TB SSAS OLAP cube, built using a Microsoft adCenter data set. The demo showed how Nehalem-EP is 2X faster than a Xeon 5400 on the same workload, with the same DRAM and I/O configuration. Not too shabby, but we’ve seen even faster results (~3-4X faster) on workloads which are more memory bandwidth-intensive, like data warehousing or in-memory OLAP workloads.

    Intel’s Dave Hill over at the The Server Room Blog writes:

    As of March 30, 2009, Intel based 2 socket Xeon® 5500 series servers set at least 30 world performance records across a wide range of benchmarks that cover virtually every application type on the market. The performance results, just by themselves, are utterly amazing, and in general they are greater than 2x the Intel® Xeon® 5400 series processors (Harpertown).

    There are numerous other benchmarks listed over at the Intel® Xeon® Processor Performance summary page. Check them out. You should be nothing less than amazed. It surely is a great time to be using commodity hardware and if you are not, perhaps you should be! And for those database vendors who are using proprietary hardware like FPGAs, well, I guess you are wishing that Intel’s Nehalem-EP processors are an April Fools’ joke, but you would be wrong.

    Larry Ellison Mentions Exadata Performance Numbers

    Yesterday Oracle Corporation had its earnings call for F3Q2009. On the call Larry and Charles mention a few of the Exadata performance numbers observed.

    Larry Ellison:

    …looking forward, I think the most exciting product we’ve had in many, many years is our Exadata Database Server…

    Exadata is 100% innovation on top of our very large and very strong database business. And the early results have been remarkable. Charles Phillips will go into a lot of detail but I’ll just throw a couple of numbers out there.

    One of our customers, and Charles will describe this customer, one of our customers saw a 28x performance improvement over an existing Oracle database. Another customer saw a monthly aggregation drop from 4.5 hours just to 3 minutes.

    When compared to Teradata, a competitive database machine that’s been in the market for a very, very long time, another customer saw that we were 6x faster than their existing Teradata application, when using Exadata versus Teradata.

    Another customer saw a batch process fall from 8 hours to 30 minutes. Charles will go into more detail on all this, he will repeat those numbers, because I think they’re worth mentioning twice.

    Charles Phillips:

    On databases, Larry mentioned, we’re very excited about how the HP Oracle database machine is performing. The increases have just been stunning and so we are getting great feedback from our customers and the pipeline is the largest build I’ve ever seen in terms of a new product.

    And as he mentioned, the numbers are just stunning. The major European retailer who reduced the batch processing time from 8 hours to 30 minutes did not believe the process had completed. We had to convince him that’s actually how it’s done.

    And so, as Larry mentioned, this is the reminder that this is an internally developed technology in the midst of all the discussion of acquisitions. People forget that we’re actually spending $3.0 billion a year on research and development and this is why we do it.

    I agree with Larry 100%. Exadata is the most exciting product and from a database performance engineer perspective (and a hardware junkie) it is quite amazing to see a single 42U rack of HP Oracle Database Machine rip and tear through terabytes of data as if it were breaking the laws of physics. Quite exciting indeed.

    The Impact Of Good Table And Query Design

    There are many ways to design tables/schemas and many ways to write SQL queries that execute against those tables/schemas. Some designs are better than others for various reasons, however, I think that frequently people underestimate the power of SQL (for both “good” and “evil”). All too often in data warehouses, I see tables designed for one specific report, or a very select few reports. These tables frequently resemble Microsoft Excel Spreadsheets (generally Pivot Tables), not good Dimensional (Star Schema) or Third Normal Form (3NF) schema design. The problem with such designs is that it severely limits the usefulness of that data, as queries that were not known at the time of design often time become problematic. The following is a simple one table example, derived from a field experience in which I discuss two table designs and provide the SQL queries to answer a question the business is seeking.

    The Business Question

    First lets start with the business question for which the answer is being sought.
    What customers meet the following criteria:

    • do not own PRODUCT1 or PRODUCT2 but have downloaded SOFTWARE
    • do not own PRODUCT2 and it has been more than 90 days between SOFTWARE download and their purchase of PRODUCT1

    Version 1: The Column Based (Pivot) Table Design

    For Version 1, there is a single row for each customer and each attribute has its own column. In this case there are 4 columns, each representing the most recent activity date for that product.

    SQL> desc column_tab
     Name                                      Null?    Type
     ----------------------------------------- -------- ----------------------------
     CUSTOMER_ID                               NOT NULL NUMBER
     SOFTWARE_MAC_RECENCY_TS                            DATE
     SOFTWARE_WIN_RECENCY_TS                            DATE
     PRODUCT1_RECENCY_TS                                DATE
     PRODUCT2_RECENCY_TS                                DATE
    
    SQL> select * from column_tab;
    
    CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
    ----------- ---------- ---------- ---------- ----------
            100 2009-03-17            2008-11-17
            200 2009-03-17            2009-01-16
            300 2009-03-17            2008-10-08 2009-02-25
            400            2009-03-17 2008-11-07
            500 2009-03-17
    
    5 rows selected.
    
    SQL> select customer_id
      2  from   column_tab
      3  where  product2_recency_ts is null and
      4         (((software_win_recency_ts is not null or
      5            software_mac_recency_ts is not null) and
      6           product1_recency_ts is null) or
      7          ((software_win_recency_ts - product1_recency_ts) > 90 or
      8           (software_mac_recency_ts - product1_recency_ts) > 90));
    
    CUSTOMER_ID
    -----------
            100
            400
            500
    
    3 rows selected.
    
    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 4293700422
    
    --------------------------------------------------------------------------------
    | Id  | Operation         | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
    --------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT  |            |     2 |    42 |     3   (0)| 00:00:01 |
    |*  1 |  TABLE ACCESS FULL| COLUMN_TAB |     2 |    42 |     3   (0)| 00:00:01 |
    --------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       1 - filter("PRODUCT2_RECENCY_TS" IS NULL AND ("PRODUCT1_RECENCY_TS"
                  IS NULL AND ("SOFTWARE_MAC_RECENCY_TS" IS NOT NULL OR
                  "SOFTWARE_WIN_RECENCY_TS" IS NOT NULL) OR
                  "SOFTWARE_MAC_RECENCY_TS"-"PRODUCT1_RECENCY_TS">90 OR
                  "SOFTWARE_WIN_RECENCY_TS"-"PRODUCT1_RECENCY_TS">90))
    

    As you can see, the query construct to answer the business question is straight forward and requires just one pass over the table.

    Version 2: The Row Based (Unpivot) Table, Take 1

    In Version 2, there is a single row (tuple) which tracks the customer, product and the recency date. Unlike Version 1, none of the columns can be NULL.

    SQL> desc row_tab
     Name                                      Null?    Type
     ----------------------------------------- -------- ----------------------------
     CUSTOMER_ID                               NOT NULL NUMBER
     RECENCY_TS                                NOT NULL DATE
     PRODUCT                                   NOT NULL VARCHAR2(32)
    
    SQL> select * from row_tab;
    
    CUSTOMER_ID RECENCY_TS PRODUCT
    ----------- ---------- --------------------------------
            100 2009-03-17 SOFTWARE_MAC
            200 2009-03-17 SOFTWARE_MAC
            300 2009-03-17 SOFTWARE_MAC
            500 2009-03-17 SOFTWARE_MAC
            400 2009-03-17 SOFTWARE_WIN
            100 2008-11-17 PRODUCT1
            200 2009-01-16 PRODUCT1
            300 2008-10-08 PRODUCT1
            400 2008-11-07 PRODUCT1
            300 2009-02-25 PRODUCT2
    
    10 rows selected.
    
    SQL> select a.customer_id
      2  from   row_tab a,
      3         (select customer_id,
      4                 product,
      5                 recency_ts
      6          from   row_tab
      7          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
      8  where  a.customer_id not in (select customer_id
      9                               from   row_tab
     10                               where  product in ('PRODUCT1', 'PRODUCT2')) and
     11         a.customer_id = b.customer_id
     12  union
     13  select a.customer_id
     14  from   row_tab a,
     15         (select customer_id,
     16                 product,
     17                 recency_ts
     18          from   row_tab
     19          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
     20  where  a.customer_id not in (select customer_id
     21                               from   row_tab
     22                               where  product = 'PRODUCT2') and
     23         a.customer_id = b.customer_id and
     24         (a.product = 'PRODUCT1' and
     25          b.recency_ts - a.recency_ts > 90);
    
    CUSTOMER_ID
    -----------
            100
            400
            500
    
    3 rows selected.
    
    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 3517586312
    
    ---------------------------------------------------------------------------------
    | Id  | Operation             | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
    ---------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT      |         |    11 |   368 |    22  (60)| 00:00:01 |
    |   1 |  SORT UNIQUE          |         |    11 |   368 |    22  (60)| 00:00:01 |
    |   2 |   UNION-ALL           |         |       |       |            |          |
    |*  3 |    HASH JOIN ANTI     |         |    10 |   310 |    10  (10)| 00:00:01 |
    |*  4 |     HASH JOIN         |         |    11 |   187 |     7  (15)| 00:00:01 |
    |*  5 |      TABLE ACCESS FULL| ROW_TAB |     5 |    70 |     3   (0)| 00:00:01 |
    |   6 |      TABLE ACCESS FULL| ROW_TAB |    10 |    30 |     3   (0)| 00:00:01 |
    |*  7 |     TABLE ACCESS FULL | ROW_TAB |     5 |    70 |     3   (0)| 00:00:01 |
    |*  8 |    HASH JOIN ANTI     |         |     1 |    58 |    10  (10)| 00:00:01 |
    |*  9 |     HASH JOIN         |         |     1 |    44 |     7  (15)| 00:00:01 |
    |* 10 |      TABLE ACCESS FULL| ROW_TAB |     4 |    88 |     3   (0)| 00:00:01 |
    |* 11 |      TABLE ACCESS FULL| ROW_TAB |     5 |   110 |     3   (0)| 00:00:01 |
    |* 12 |     TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
    ---------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       3 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
       4 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
       5 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
       7 - filter("PRODUCT"='PRODUCT1' OR "PRODUCT"='PRODUCT2')
       8 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
       9 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
           filter("RECENCY_TS"-"A"."RECENCY_TS">90)
      10 - filter("A"."PRODUCT"='PRODUCT1')
      11 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
      12 - filter("PRODUCT"='PRODUCT2')
    

    Version 2, Take 2

    The way the query is written in Version 2, Take 1, it requires six accesses to the table. Partly this is because it uses a UNION. In this case the UNION can be removed and replaced with an OR branch.

    SQL> select a.customer_id
      2  from   row_tab a,
      3         (select customer_id,
      4                 product,
      5                 recency_ts
      6          from   row_tab
      7          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
      8  where  a.customer_id = b.customer_id and
      9         ((a.customer_id not in (select customer_id
     10                               from   row_tab
     11                               where  product in ('PRODUCT1', 'PRODUCT2')))
     12         or
     13         ((a.customer_id not in (select customer_id
     14                               from   row_tab
     15                               where  product = 'PRODUCT2') and
     16         (a.product = 'PRODUCT1' and
     17          b.recency_ts - a.recency_ts > 90))))
     18  /
    
    CUSTOMER_ID
    -----------
            100
            400
            500
    
    3 rows selected.
    
    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 3327813549
    
    -------------------------------------------------------------------------------
    | Id  | Operation           | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
    -------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT    |         |     1 |    44 |     7  (15)| 00:00:01 |
    |*  1 |  FILTER             |         |       |       |            |          |
    |*  2 |   HASH JOIN         |         |    11 |   484 |     7  (15)| 00:00:01 |
    |*  3 |    TABLE ACCESS FULL| ROW_TAB |     5 |   110 |     3   (0)| 00:00:01 |
    |   4 |    TABLE ACCESS FULL| ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
    |*  5 |   TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
    |*  6 |   TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
    -------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       1 - filter( NOT EXISTS (SELECT 0 FROM "ROW_TAB" "ROW_TAB" WHERE
                  "CUSTOMER_ID"=:B1 AND ("PRODUCT"='PRODUCT1' OR "PRODUCT"='PRODUCT2'))
                  OR  NOT EXISTS (SELECT 0 FROM "ROW_TAB" "ROW_TAB" WHERE
                  "PRODUCT"='PRODUCT2' AND "CUSTOMER_ID"=:B2) AND
                  "A"."PRODUCT"='PRODUCT1' AND "RECENCY_TS"-"A"."RECENCY_TS">90)
       2 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
       3 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
       5 - filter("CUSTOMER_ID"=:B1 AND ("PRODUCT"='PRODUCT1' OR
                  "PRODUCT"='PRODUCT2'))
       6 - filter("PRODUCT"='PRODUCT2' AND "CUSTOMER_ID"=:B1)
    

    This rewrite brings the table accesses down to four from six, so progress is being made, but I think we can do even better.

    Version 2, Take 3

    SQL is a very powerful language and there is usually more than one way to structure a query. Version 2, Take 1 uses a very literal translation of the business question and Take 2 just does a mild rewrite changing the UNION to an OR. In Version 2, Take 3, I am going to leverage some different, but very powerful functionality to yield the same results.

    SQL> -- COLUMN_TAB can be expressed using ROW_TAB with MAX + CASE WHEN + GROUP BY:
    SQL> select   customer_id,
      2           max (case
      3                   when product = 'SOFTWARE_MAC'
      4                      then recency_ts
      5                end) software_mac_recency_ts,
      6           max (case
      7                   when product = 'SOFTWARE_WIN'
      8                      then recency_ts
      9                end) software_win_recency_ts,
     10           max (case
     11                   when product = 'PRODUCT1'
     12                      then recency_ts
     13                end) product1_recency_ts,
     14           max (case
     15                   when product = 'PRODUCT2'
     16                      then recency_ts
     17                end) product2_recency_ts
     18  from     row_tab
     19  group by customer_id;
    
    CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
    ----------- ---------- ---------- ---------- ----------
            100 2009-03-17            2008-11-17
            200 2009-03-17            2009-01-16
            300 2009-03-17            2008-10-08 2009-02-25
            400            2009-03-17 2008-11-07
            500 2009-03-17
    
    5 rows selected.
    
    SQL> -- The original query can be expressed as follows:
    SQL> select customer_id
      2  from   (select   customer_id,
      3                   max (case
      4                           when product = 'SOFTWARE_MAC'
      5                              then recency_ts
      6                        end) software_mac_recency_ts,
      7                   max (case
      8                           when product = 'SOFTWARE_WIN'
      9                              then recency_ts
     10                        end) software_win_recency_ts,
     11                   max (case
     12                           when product = 'PRODUCT1'
     13                              then recency_ts
     14                        end) product1_recency_ts,
     15                   max (case
     16                           when product = 'PRODUCT2'
     17                              then recency_ts
     18                        end) product2_recency_ts
     19          from     row_tab
     20          group by customer_id)
     21  where  product2_recency_ts is null and
     22         (((software_win_recency_ts is not null or
     23            software_mac_recency_ts is not null) and
     24           product1_recency_ts is null) or
     25          ((software_win_recency_ts - product1_recency_ts) > 90 or
     26           (software_mac_recency_ts - product1_recency_ts) > 90)
     27         );
    
    CUSTOMER_ID
    -----------
            100
            400
            500
    
    3 rows selected.
    
    
    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 825621652
    
    -------------------------------------------------------------------------------
    | Id  | Operation           | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
    -------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT    |         |     1 |    22 |     4  (25)| 00:00:01 |
    |*  1 |  FILTER             |         |       |       |            |          |
    |   2 |   HASH GROUP BY     |         |     1 |    22 |     4  (25)| 00:00:01 |
    |   3 |    TABLE ACCESS FULL| ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
    -------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       1 - filter(MAX(CASE "PRODUCT" WHEN 'PRODUCT2' THEN "RECENCY_TS" END
                  ) IS NULL AND ((MAX(CASE "PRODUCT" WHEN 'SOFTWARE_WIN' THEN
                  "RECENCY_TS" END ) IS NOT NULL OR MAX(CASE "PRODUCT" WHEN
                  'SOFTWARE_MAC' THEN "RECENCY_TS" END ) IS NOT NULL) AND MAX(CASE
                  "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END ) IS NULL OR MAX(CASE
                  "PRODUCT" WHEN 'SOFTWARE_WIN' THEN "RECENCY_TS" END )-MAX(CASE
                  "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END )>90 OR MAX(CASE
                  "PRODUCT" WHEN 'SOFTWARE_MAC' THEN "RECENCY_TS" END )-MAX(CASE
                  "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END )>90))
    

    Rewriting the query as a CASE WHEN with a GROUP BY not only cleaned up the SQL, it also resulted in a single pass over the table. Version 2, Take 3 reduces the table access from four to one!

    Version 2, Take 4: The PIVOT operator in 11g

    In 11g the PIVOT operator was introduced and can simplify the query even more.

    SQL> -- In 11g the PIVOT operator can be used, so COLUMN_TAB can be expressed as:
    SQL> select *
      2  from row_tab
      3  pivot (max(recency_ts) for product in
      4         ('SOFTWARE_MAC' as software_mac_recency_ts,
      5          'SOFTWARE_WIN' as software_win_recency_ts,
      6          'PRODUCT1' as product1_recency_ts,
      7          'PRODUCT2' as product2_recency_ts));
    
    CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
    ----------- ---------- ---------- ---------- ----------
            100 2009-03-17            2008-11-17
            200 2009-03-17            2009-01-16
            300 2009-03-17            2008-10-08 2009-02-25
            400            2009-03-17 2008-11-07
            500 2009-03-17
    
    5 rows selected.
    
    SQL> -- Using PIVOT the original query can be expressed as:
    SQL> select customer_id
      2  from   row_tab
      3  pivot  (max(recency_ts) for product in
      4         ('SOFTWARE_MAC' as software_mac_recency_ts,
      5          'SOFTWARE_WIN' as software_win_recency_ts,
      6          'PRODUCT1' as product1_recency_ts,
      7          'PRODUCT2' as product2_recency_ts))
      8  where  product2_recency_ts is null and
      9         (((software_win_recency_ts is not null or
     10            software_mac_recency_ts is not null) and
     11           product1_recency_ts is null) or
     12          ((software_win_recency_ts - product1_recency_ts) > 90 or
     13           (software_mac_recency_ts - product1_recency_ts) > 90)
     14         );
    
    CUSTOMER_ID
    -----------
            100
            400
            500
    
    3 rows selected.
    
    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 3127820873
    
    --------------------------------------------------------------------------------
    | Id  | Operation            | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
    --------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT     |         |     1 |    22 |     4  (25)| 00:00:01 |
    |*  1 |  FILTER              |         |       |       |            |          |
    |   2 |   HASH GROUP BY PIVOT|         |     1 |    22 |     4  (25)| 00:00:01 |
    |   3 |    TABLE ACCESS FULL | ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
    --------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       1 - filter(MAX(CASE  WHEN ("PRODUCT"='PRODUCT2') THEN "RECENCY_TS"
                  END ) IS NULL AND ((MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_WIN') THEN
                  "RECENCY_TS" END ) IS NOT NULL OR MAX(CASE  WHEN
                  ("PRODUCT"='SOFTWARE_MAC') THEN "RECENCY_TS" END ) IS NOT NULL) AND
                  MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END ) IS NULL
                  OR MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_WIN') THEN "RECENCY_TS" END
                  )-MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END )>90 OR
                  MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_MAC') THEN "RECENCY_TS" END
                  )-MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END )>90))
    

    The Big Picture

    One thing that I did not touch on is the flexibility of the ROW_TAB design when it comes to evolution. Any number of products can be added without making any modifications to the loading process. In order to do this with the COLUMN_TAB a new column must be added for each new product. The other major difference between the two table designs is that ROW_TAB is insert only while COLUMN_TAB must be updated if the customer exists. Generally one wants to avoid updated in a data warehouse as 1) old data is usually over written and 2) updates are more expensive than inserts.

    The other major thing I won’t discuss in detail is how to partition or index (if required) COLUMN_TAB. Think about this. With ROW_TAB it is very straight forward.

    Summary

    There are many ways to design tables and write queries. Some of them work well, some do not. Some appear impossible at first, only to appear more simple later. Literal translation of a business question into SQL is usually far from optimal. One needs to think about the question being asked, the shape of the data, and the options available to solve that problem as well as the trade offs of those solutions. Remember: table definitions do not have to look like Spreadsheets. Generally only the output of a query needs to.

    Don’t get stuck in SQL-92. It is the year 2009. You should be writing your SQL using the constructs that are provided. Often times very complex data transformations can be done with just SQL. Leverage this power.

    All experiments performed on 11.1.0.7