Category: Performance

The Core Performance Fundamentals Of Oracle Data Warehousing – Table Compression

[back to Introduction]
Editor’s note: This blog post does not cover Exadata Hybrid Columnar Compression.

The first thing that comes to most people’s mind when database table compression is mentioned is the savings it yields in terms of disk space. While reducing the footprint of data on disk is relevant, I would argue it is the lesser of the benefits for data warehouses. Disk capacity is very cheap and generally plentiful, however, disk bandwidth (scan speed) is proportional to the number of spindles, no mater what the disk capacity and thus is more expensive. Table compression reduces the footprint on the disk drives that a given data set occupies so the amount of physical data that must be read off the disk platters is reduced when compared to the uncompressed version. For example, if 4000 GB of raw data can compress to 1000 GB, it can be read off the same disk drives 4X as fast because it is reading and transferring 1/4 of the data off the spindles (relative to the uncompressed size). Likewise, table compression allows for the database buffer cache to contain more data without having to increase the memory allocation because more rows can be stored in a compressed block/page compared to an uncompressed block/page.

Row major table compression comes in two flavors with the Oracle database: BASIC and OLTP.  In 11.1 these were also known by the key phrases COMPRESS or COMPRESS FOR DIRECT_LOAD OPERATIONS and COMPRESS FOR ALL OPERATIONS.  The BASIC/DIRECT_LOAD compression has been part of the Oracle database since version 9 and ALL OPERATIONS/OLTP compression was introduced in 11.1 with the Advanced Compression option.

Oracle row major table compression works by storing the column values for a given block in a symbol table at the beginning of the block. The more repeated values per block, even across columns, the better the compression ratio. Sorting data can increase the compression ratio as ordering the data will generally allow more repeat values per block. Specific compression ratios and gains from sorting data are very data dependent but compression ratios are generally between 2x and 4x.

Compression does add some CPU overhead when direct path loading data, but there is no measurable performance overhead when reading data as the Oracle database can operate on compressed blocks directly without having to first uncompress the block. The additional CPU required when bulk loading data is generally well worth the down wind gains for data warehouses. This is because most data in a well designed data warehouse is write once, read many times. Insert only and infrequently modified tables are ideal candidates for BASIC compression.  If the tables have significant DML performed against them, then OLTP compression would be advised (or no compression).

Given that most Oracle data warehouses that I have seen are constrained by I/O bandwidth (see Balanced Hardware Configuration) it is highly recommended to leverage compression so the logical table scan rate can increase proportionally to the compression ratio.  This will result in faster table and partition scans on the same hardware.

Oracle Documentation References:

The Core Performance Fundamentals Of Oracle Data Warehousing – Balanced Hardware Configuration

[back to Introduction]

If you want to build a house that will stand the test of time, you need to build on a solid foundation. The same goes for architecting computer systems that run databases. If the underlying hardware is not sized appropriately it will likely lead to people blaming software. All too often I see data warehouse systems that are poorly architected for the given workload requirements. I frequently tell people, “you can’t squeeze blood from a turnip“, meaning if the hardware resources are not there for the software to use, how can you expect the software to scale?

Undersizing data warehouse systems has become an epidemic with open platforms – platforms that let you run on any brand and configuration of hardware. This problem has been magnified over time as the size of databases have grown significantly, and generally outpacing the experience of those managing them. This has caused the “big three” database vendors to come up with suggested or recommended hardware configurations for their database platforms:

 

Simply put, the reasoning behind those initiatives was to help customers architect systems that are well balanced and sized appropriately for the size of their data warehouse.

Balanced Hardware Configurations

The foundation for a well performing data warehouse (or any system for that matter) is the hardware that it runs on. There are three main hardware resources to consider when sizing your data warehouse hardware. Those are:

  1. Number of CPUs
  2. Number of storage devices (HDDs or SSDs)
  3. I/O bandwidth between CPUs and storage devices

 

NB: I’ve purposely left off memory (RAM) as most systems are pretty well sized at 2GB or 4GB per CPU core these days.

A balanced system has the following characteristics:

dd459146image013en-us.jpg

 

As you can see, each of the three components are sized proportionally to each other. This allows for the max system throughput capacity as no single resource will become the bottleneck before any other. This was one of the critical design decisions that went into the Oracle Database Machine.

Most DBAs and System Admins know what the disk capacity numbers are for their systems, but when it comes to I/O bandwidth or scan rates, most are unaware of what the system is capable of in theory, let alone in practice. Perhaps I/O bandwidth utilization should be included in the system metrics that are collected for your databases. You do collect system metrics, right?

There are several “exchanges” that data must flow through from storage devices to host CPUs, many of which could become bottlenecks. Those include:

  • Back-end Fibre Channel loops (the fibre between the drive shelves and the storage array server processor)
  • Front-end Fibre Channel ports
  • Storage array server processors (SP)
  • Host HBAs

One should understand the throughput capacity of each of these components to ensure that one (or more) of them do not restrict the flow of data to the CPUs prematurely.

 

Unbalanced Hardware Configurations

All too frequently systems are not architected as balanced systems and the system ends up being constrained in one of the following three scenarios:

dd459146image011en-us.jpg

 

From the production systems that I have seen, the main deficiency is in I/O bandwidth (both I/O Channel and HDD). I believe there are several reasons for this. First, too many companies capacity plan for their data warehouse based on the size the data occupies on disk alone. That is, they purchase the number of HDDs for the system based on the drive capacity, not on the I/O bandwidth requirement. Think of it like this: If you were to purchase 2 TB of mirrored disk capacity (4 TB total) would you rather purchase 28 x 146 GB drives or 14 x 300 GB drives (or even 4 x 1 TB drives)? You may ask: Well, what is the difference (other than price); in each case you have the same net capacity, correct? Indeed, both configurations do have the same capacity, but I/O bandwidth (how fast you can read data off the HDDs) is proportional to the number of HDDs, not the capacity. Thus it should be slightly obvious then that 28 HDDs can deliver 2X the disk I/O bandwidth that 14 HDDs can. This means that it will take 2X as long to read the same amount of data off of 14 HDDs as 28 HDDs.

Unfortunately what tends to happen is that the bean counter types will see only two things:

  1. The disk capacity (space) you want to purchase (or the capacity that is required)
  2. The price per MB/GB/TB

This is where someone worthy of the the title systems architect needs to stand up and explain the concept of I/O bandwidth and the impact it has on data warehouse performance (your systems architect does know this, correct?). This is generally a difficult discussion because I/O bandwidth is not a line item on a purchase order, it is a derived metric that requires both thought and engineering (which means someone had to do some thinking about the requirements for this system!).

 

Summary

When sizing the hardware for your data warehouse consider your workload and understand following (and calculate numbers for them!):

  1. What rate (in MB/GB per second) can the CPUs consume data?
  2. What rate can storage devices produce data (scan rate)?
  3. What rate can the data be delivered from the storage array(s) to the host HBAs?

 

If you are unable to answer these questions in theory then you need to sit down and do some calculations. Then you need to use some micro benchmarks (like Oracle ORION) and prove out those calculations. This will give you the “speed limit” and an metric by which you can measure your database workload against. All computer systems much obey the laws of physics! There is no way around that.

Additional Readings

Kevin Closson has several good blog posts on related topics including:

as well as numerous others.

 

Oracle Documentation References:

 

The Core Performance Fundamentals Of Oracle Data Warehousing – Introduction

At the 2009 Oracle OpenWorld Unconference back in October I lead a chalk and talk session entitled The Core Performance Fundamentals Of Oracle Data Warehousing. Since this was a chalk and talk I spared the audience any powerpoint slides but I had several people request that make it into a presentation so they could share it with others. After some thought, I decided that a series of blog posts would probably be a better way to share this information, especially since I tend to use slides as a speaking outline, not a condensed version of a white paper. This will be the first of a series of posts discussing what I consider to be the key features and technologies behind well performing Oracle data warehouses.

Introduction

As an Oracle database performance engineer who has done numerous customer data warehouse benchmarks and POCs over the past 5+ years, I’ve seen many data warehouse systems that have been plagued with problems on nearly every DBMS commonly used in data warehousing. Interestingly enough, many of these systems were facing many of the same problems. I’ve compiled a list of topics that I consider to be key features and/or technologies for Oracle data warehouses:

Core Performance Fundamental Topics

In the upcoming posts, I’ll deep dive into each one of these topics discussing why these areas are key for a well performing Oracle data warehouse. Stay tuned…

Top 10 Oracle 11gR2 New Features

In catching up on blog posts I see that Jonathan Lewis, Christian Antognini and Nuno Souto picked up on the deferred segment creation new feature in Oracle 11gR2. In keeping with the theme, I thought I’d put together the top 10 new features in Oracle Database 11g Release 2 (11.2) that I consider significant.

  1. Analytic Functions 2.0
  2. Recursive WITH Clause
  3. Preprocessing Data for ORACLE_LOADER Access Driver in External Tables
  4. In-Memory Parallel Execution
  5. Auto Degree of Parallelism (Auto DOP) and Parallel Statement Queuing
  6. Significant Performance Improvement of MV On-Commit Fast Refresh
  7. Database Resource Manager Instance Caging
  8. ASM Intelligent Data Placement
  9. Database File System (DBFS)
  10. Hybrid Columnar Compression

In future posts I’ll dive into some of these in more technical detail but for now I thought I’d throw my list out there to raise awareness of the things I am looking at as a database performance engineer.

Oracle OpenWorld 2009: The Real-World Performance Group

Even though Oracle OpenWorld 2009 is a few months away, I thought I would take a moment to mention that the Oracle Real-World Performance Group will again be hosting three sessions. Hopefully you are no stranger to our Oracle database performance sessions and this year we have what I think will be a very exciting and enlightening session: The Terabyte Hour with the Real-World Performance Group. If you are the slightest bit interested in seeing just how fast the Oracle Database Machine really is and how it can devour flat files in no time, rip through and bend data at amazing speeds, this is the session for you. All the operations will be done live for you to observe. No smoke. No mirrors. Pure Exadata performance revealed.

Session ID: S311237
Session Title: Real-World Database Performance Roundtable
Session Abstract: This session is a panel discussion including Oracle’s Real-World Performance Group and other invited performance experts. To make the hour productive, attendees need to write their questions on postcards and hand them to the panel at the beginning of the session. The questions should stick to the subject matter of real-world database performance. The panel members look forward to meeting you and answering your questions.
   
Session ID: S311239
Session Title: The Terabyte Hour with the Real-World Performance Group
Session Abstract: Last year at Oracle OpenWorld, Oracle launched the Oracle Database Machine, a complete package of software, servers, and storage with the power to tackle large-scale business intelligence problems immediately and scale linearly as your data warehouse grows. In this session, Oracle’s Real-World Performance Group demonstrates how to use an Oracle Database Machine to load, transform, and query a 1-terabyte database in less than an hour. The demonstration shows techniques for exploiting full database parallelism in a simple but optimal manner.
   
Session ID: S311238
Session Title: Current Trends in Real-World Database Performance
Session Abstract: The year 2009 has been an exciting one for Oracle’s Real-World Performance Group. The group has been challenged by bigger databases, new performance challenges, and now the Oracle Database Machine with Oracle Exadata Storage Server. This session focuses on some of the real-world performance ideas and solutions that have worked over the last year, including performance design philosophies, best practices, and a few tricks and tips.

Oracle Parallel Execution: Interconnect Myths And Misunderstandings

A number of weeks back I had come across a paper/presentation by Riyaj Shamsudeen entitled Battle of the Nodes: RAC Performance Myths (avaiable here). As I was looking through it I saw one example that struck me as very odd (Myth #3 – Interconnect Performance) and I contacted him about it. After further review Riyaj commented that he had made a mistake in his analysis and offered up a new example. I thought I’d take the time to discuss this as parallel execution seems to be one of those areas where many misconceptions and misunderstandings exist.

The Original Example

I thought I’d quickly discuss why I questioned the initial example. The original query Riyaj cited is this one:

select /*+ full(tl) parallel (tl,4) */
       avg (n1),
       max (n1),
       avg (n2),
       max (n2),
       max (v1)
from   t_large tl;

As you can see this is a very simple single table aggregation without a group by. The reason that I questioned the validity of this example in the context of interconnect performance is that the parallel execution servers (parallel query slaves) will each return exactly one row from the aggregation and then send that single row to the query coordinator (QC) which will then perform the final aggregation. Given that, it would seem impossible that this query could cause any interconnect issues at all.

Riyaj’s Test Case #1

Recognizing the original example was somehow flawed, Riyaj came up with a new example (I’ll reference as TC#1) which consisted of the following query:

select /*+ parallel (t1, 8,2) parallel (t2, 8, 2)  */
       min (t1.customer_trx_line_id + t2.customer_trx_line_id),
       max (t1.set_of_books_id + t2.set_of_books_id),
       avg (t1.set_of_books_id + t2.set_of_books_id),
       avg (t1.quantity_ordered + t2.quantity_ordered),
       max (t1.attribute_category),
       max (t2.attribute1),
       max (t1.attribute2)
from   (select *
        from   big_table
        where  rownum <= 100000000) t1,
       (select *
        from   big_table
        where  rownum <= 100000000) t2
where  t1.customer_trx_line_id = t2.customer_trx_line_id;

The execution plan for this query is:

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |           |     1 |   249 |       |  2846K  (4)| 01:59:01 |        |      |            |
|   1 |  SORT AGGREGATE           |           |     1 |   249 |       |            |          |        |      |            |
|*  2 |   HASH JOIN               |           |   100M|    23G|   762M|  2846K  (4)| 01:59:01 |        |      |            |
|   3 |    VIEW                   |           |   100M|    10G|       |  1214K  (5)| 00:50:46 |        |      |            |
|*  4 |     COUNT STOPKEY         |           |       |       |       |            |          |        |      |            |
|   5 |      PX COORDINATOR       |           |       |       |       |            |          |        |      |            |
|   6 |       PX SEND QC (RANDOM) | :TQ10000  |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | P->S | QC (RAND)  |
|*  7 |        COUNT STOPKEY      |           |       |       |       |            |          |  Q1,00 | PCWC |            |
|   8 |         PX BLOCK ITERATOR |           |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | PCWC |            |
|   9 |          TABLE ACCESS FULL| BIG_TABLE |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | PCWP |            |
|  10 |    VIEW                   |           |   100M|    12G|       |  1214K  (5)| 00:50:46 |        |      |            |
|* 11 |     COUNT STOPKEY         |           |       |       |       |            |          |        |      |            |
|  12 |      PX COORDINATOR       |           |       |       |       |            |          |        |      |            |
|  13 |       PX SEND QC (RANDOM) | :TQ20000  |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | P->S | QC (RAND)  |
|* 14 |        COUNT STOPKEY      |           |       |       |       |            |          |  Q2,00 | PCWC |            |
|  15 |         PX BLOCK ITERATOR |           |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | PCWC |            |
|  16 |          TABLE ACCESS FULL| BIG_TABLE |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."n1"="T2"."n1")
   4 - filter(ROWNUM<=100000000)
   7 - filter(ROWNUM<=100000000)
  11 - filter(ROWNUM<=100000000)
  14 - filter(ROWNUM<=100000000)

This is a rather synthetic query but there are a few things that I would like to point out. First, this query uses a parallel hint with 3 values representing table/degree/instances, however instances has been deprecated (see 10.2 parallel hint documentation). In this case the DOP is calculated by degree * instances or 16, not DOP=8 involving 2 instances. Note that the rownum filter is causing all the rows from the tables to be sent back to the QC for the COUNT STOPKEY operation thus causing the execution plan to serialize, denoted by the P->S in the IN-OUT column.

Riyaj had enabled sql trace for the QC and the TKProf output is such:

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  SORT AGGREGATE (cr=152 pr=701158 pw=701127 time=1510221226 us)
98976295   HASH JOIN  (cr=152 pr=701158 pw=701127 time=1244490336 us)
100000000    VIEW  (cr=76 pr=0 pw=0 time=200279054 us)
100000000     COUNT STOPKEY (cr=76 pr=0 pw=0 time=200279023 us)
100000000      PX COORDINATOR  (cr=76 pr=0 pw=0 time=100270084 us)
      0       PX SEND QC (RANDOM) :TQ10000 (cr=0 pr=0 pw=0 time=0 us)
      0        COUNT STOPKEY (cr=0 pr=0 pw=0 time=0 us)
      0         PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us)
      0          TABLE ACCESS FULL BIG_TABLE_NAME_CHANGED_12 (cr=0 pr=0 pw=0 time=0 us)
100000000    VIEW  (cr=76 pr=0 pw=0 time=300298770 us)
100000000     COUNT STOPKEY (cr=76 pr=0 pw=0 time=200298726 us)
100000000      PX COORDINATOR  (cr=76 pr=0 pw=0 time=200279954 us)
      0       PX SEND QC (RANDOM) :TQ20000 (cr=0 pr=0 pw=0 time=0 us)
      0        COUNT STOPKEY (cr=0 pr=0 pw=0 time=0 us)
      0         PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us)
      0          TABLE ACCESS FULL BIG_TABLE_NAME_CHANGED_12 (cr=0 pr=0 pw=0 time=0 us)

Note that the Rows column contains zeros for many of the row sources because this trace is only for the QC, not the slaves, and thus only QC rows will show up in the trace file. Something to be aware of if you decide to use sql trace with parallel execution.

Off To The Lab: Myth Busting Or Bust!

I wanted to take a query like TC#1 and run it in my own environment so I could do more monitoring of it. Given the alleged myth had to do with interconnect traffic of cross-instance (inter-node) parallel execution, I wanted to be certain to gather the appropriate data. I ran several tests using a similar query on a similar sized data set (by row count) as the initial example. I ran all my experiments on a Oracle Real Application Clusters version 11.1.0.7 consisting of eight nodes, each with two quad-core CPUs. The interconnect is InfiniBand and the protocol used is RDS (Reliable Datagram Sockets).

Before I get into the experiments I think it is worth mentioning that Oracle's parallel execution (PX), which includes Parallel Query (PQ), PDML & PDDL, can consume vast amounts of resources. This is by design. You see, the idea of Oracle PX is to dedicate a large amount of resources (processes) to a problem by breaking it up into many smaller pieces and then operate on those pieces in parallel. Thus the more parallelism that is used to solve a problem, the more resources it will consume, assuming those resources are available. That should be fairly obvious, but I think it is worth stating.

For my experiments I used a table that contains just over 468M rows.

Below is my version of TC#1. The query is a self-join on a unique key and the table is range partitioned by DAY_KEY into 31 partitions. Note that I create a AWR snapshot immediately before and after the query.

exec dbms_workload_repository.create_snapshot

select /* &amp;&amp;1 */
       /*+ parallel (t1, 16) parallel (t2, 16) */
       min (t1.bsns_unit_key + t2.bsns_unit_key),
       max (t1.day_key + t2.day_key),
       avg (t1.day_key + t2.day_key),
       max (t1.bsns_unit_typ_cd),
       max (t2.curr_ind),
       max (t1.load_dt)
from   dwb_rtl_trx t1,
       dwb_rtl_trx t2
where  t1.trx_nbr = t2.trx_nbr;

exec dbms_workload_repository.create_snapshot

Experiment Results Using Fixed DOP=16

I ran my version of TC#1 across a varying number of nodes by using Oracle services (instance_groups and parallel_instance_group have been deprecated in 11g), but kept the DOP constant at 16 for all the tests. Below is a table of the experiment results.

Nodes
Elapsed Time
SQL Monitor Report
AWR Report
AWR SQL Report
1
00:04:54.12
2
00:03:55.35
4
00:02:59.24
8
00:02:14.39

Seemingly contrary to what many people would probably guess, the execution times got better the more nodes that participated in the query even though the DOP constant throughout each of tests.

Measuring The Interconnect Traffic

One of the new additions to the AWR report in 11g was the inclusion of interconnect traffic by client. This section is near the bottom of the report and looks like such:
Interconnect Throughput By Client
This allows the PQ message traffic to be tracked, whereas in prior releases it was not.

Even though AWR contains the throughput numbers (as in megabytes per second) I thought it would be interesting to see how much data was being transferred, so I used the following query directly against the AWR data. I put a filter predicate on to return only where there DIFF_RECEIVED_MB >= 10MB so the instances that were not part of the execution are filtered out, as well as the single instance execution.

break on snap_id skip 1
compute sum of DIFF_RECEIVED_MB on SNAP_ID
compute sum of DIFF_SENT_MB on SNAP_ID

select *
from   (select   snap_id,
                 instance_number,
                 round ((bytes_sent - lag (bytes_sent, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_sent_mb,
                 round ((bytes_received - lag (bytes_received, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_received_mb
        from     dba_hist_ic_client_stats
        where    name = 'ipq' and
                 snap_id between 910 and 917
        order by snap_id,
                 instance_number)
where  snap_id in (911, 913, 915, 917) and
       diff_received_mb &gt;= 10
/

SNAP_ID    INSTANCE_NUMBER DIFF_SENT_MB DIFF_RECEIVED_MB
---------- --------------- ------------ ----------------
       913               1        11604            10688
                         2        10690            11584
**********                 ------------ ----------------
sum                               22294            22272

       915               1         8353             8350
                         2         8133             8418
                         3         8396             8336
                         4         8514             8299
**********                 ------------ ----------------
sum                               33396            33403

       917               1         5033             4853
                         2         4758             4888
                         3         4956             4863
                         4         5029             4852
                         5         4892             4871
                         6         4745             4890
                         7         4753             4889
                         8         4821             4881
**********                 ------------ ----------------
sum                               38987            38987

As you can see from the data, the more nodes that were involved in the execution, the more interconnect traffic there was, however, the execution times were best with 8 nodes.

Further Explanation Of Riyaj's Issue

If you read Riyaj's post, you noticed that he observed worse, not better as I did, elapsed times when running on two nodes versus one. How could this be? It was noted in the comment thread of that post that the configuration was using Gig-E as the interconnect in a Solaris IPMP active-passive configuration. This means the interconnect speeds would be capped at 128MB/s (1000Mbps), the wire speed of Gig-E. This is by all means is an inadequate configuration to use cross-instance parallel execution.

There is a whitepaper entitled Oracle SQL Parallel Execution that discusses many of the aspects of Oracle's parallel execution. I would highly recommend reading it. This paper specifically mentions:

If you use a relatively weak interconnect, relative to the I/O bandwidth from the server to the storage configuration, then you may be better of restricting parallel execution to a single node or to a limited number of nodes; inter-node parallel execution will not scale with an undersized interconnect.

I would assert that this is precisely the root cause (insufficient interconnect bandwidth for cross-instance PX) behind the issues that Riyaj observed, thus making his execution slower on two nodes than one node.

The Advantage Of Hash Partitioning/Subpartitioning And Full Partition-Wise Joins

At the end of my comment on Riyaj's blog, I mentioned:

If a DW frequently uses large table to large table joins, then hash partitioning or subpartitioning would yield added gains as partition-wise joins will be used.

I thought that it would be both beneficial and educational to extend TC#1 and implement hash subpartitioning so that the impact could be measured on both query elapsed time and interconnect traffic. In order for a full partition-wise join to take place, the table must be partitioned/subpartitioned on the join key column, so in this case I've hash subpartitioned on TRX_NBR. See the Oracle Documentation on Partition-Wise Joins for a more detailed discussion on PWJ.

Off To The Lab: Partition-Wise Joins

I've run through the exact same test matrix with the new range/hash partitioning model and below are the results.

Nodes
Elapsed Time
SQL Monitor Report
AWR Report
AWR SQL Report
1
00:02:42.41
2
00:01:37.29
4
00:01:12.82
8
00:01:05.04

As you can see by the elapsed times, the range/hash partitioning model with the full partition-wise join has decreased the overall execution time by around a factor of 2X compared to the range only partitioned version.

Now let's take a look at the interconnect traffic for the PX messages:

break on snap_id skip 1
compute sum of DIFF_RECEIVED_MB on SNAP_ID
compute sum of DIFF_SENT_MB on SNAP_ID

select *
from   (select   snap_id,
                 instance_number,
                 round ((bytes_sent - lag (bytes_sent, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_sent_mb,
                 round ((bytes_received - lag (bytes_received, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_received_mb
        from     dba_hist_ic_client_stats
        where    name = 'ipq' and
                 snap_id between 1041 and 1048
        order by snap_id,
                 instance_number)
where  snap_id in (1042,1044,1046,1048) and
       diff_received_mb &gt;= 10
/
no rows selected

Hmm. No rows selected?!? I had previously put in the predicate "DIFF_RECEIVED_MB >= 10MB" to filter out the nodes that were not participating in the parallel execution. Let me remove that predicate rerun the query.

   SNAP_ID INSTANCE_NUMBER DIFF_SENT_MB DIFF_RECEIVED_MB
---------- --------------- ------------ ----------------
      1042               1            8                6
                         2            2                3
                         3            2                3
                         4            2                3
                         5            2                3
                         6            2                3
                         7            2                3
                         8            2                3
**********                 ------------ ----------------
sum                                  22               27

      1044               1            7                7
                         2            3                2
                         3            2                2
                         4            2                2
                         5            2                2
                         6            2                2
                         7            2                2
                         8            2                2
**********                 ------------ ----------------
sum                                  22               21

      1046               1            1                2
                         2            1                2
                         3            1                2
                         4            3                1
                         5            1                1
                         6            1                1
                         7            1                1
                         8            1                1
**********                 ------------ ----------------
sum                                  10               11

      1048               1            6                5
                         2            1                2
                         3            3                2
                         4            1                2
                         5            1                2
                         6            1                2
                         7            1                2
                         8            1                2
**********                 ------------ ----------------
sum                                  15               19

Wow, there is almost no interconnect traffic at all. Let me verify with the AWR report from the 8 node execution.

The AWR report confirms that there is next to no interconnect traffic for the PWJ version of TC#1. The reason for this is that since the table is hash subpartitoned on the join column each of the subpartitions can be joined to each other minimizing the data sent between parallel execution servers. If you look at the execution plan (see the AWR SQL Report) from the first set of experiments you will notice that the broadcast method for each of the tables is HASH but in the range/hash version of TC#1 there is no broadcast at all for either of the two tables. The full partition-wise join behaves logically the same way that a shared-nothing database would; each of the parallel execution servers works on its partition which does not require data from any other partition because of the hash partitioning on the join column. The main difference is that in a shared-nothing database the data is physically hash distributed amongst the nodes (each node contains a subset of all the data) where as all nodes in a Oracle RAC database have access to the all the data.

Parting Thoughts

Personally I see no myth about cross-instance (inter-node) parallel execution and interconnect traffic, but frequently I see misunderstandings and misconceptions. As shown by the data in my experiment, TC#1 (w/o hash subpartitioning) running on eight nodes is more than 2X faster than running on one node using exactly the same DOP. Interconnect traffic is not a bad thing as long as the interconnect is designed to support the workload. Sizing the interconnect is really no different than sizing any other component of your cluster (memory/CPU/disk space/storage bandwidth). If it is undersized, performance will suffer. Depending on the number and speed of the host CPUs and the speed and bandwidth of the interconnect, your results may vary.

By hash subpartioning the table the interconnect traffic was all but eliminated and the query execution times were around 2X faster than the non-hash subpartition version of TC#1. This is obviously a much more scalable solution and one of the main reasons to leverage hash (sub)partitioning in a data warehouse.

Oracle And HP Take Back #1 Spot For 1TB TPC-H Benchmark

Oracle and HP have taken back the #1 spot by setting a new performance record in the 1TB TPC-H benchmark. The HP/Oracle result puts the Oracle database ahead of both the Exasol (currently #2 & #3) and ParAccel (currently #4) results in the race for performance at the 1TB scale factor and places Oracle in the >1 million queries per hour (QphH) club, which is no small achievement. Compared to the next best result from HP/Oracle (currently #5), this result has over 9X the query throughput (1,166,976 QphH vs. 123,323 QphH) at around 1/4 the cost (5.42 USD vs. 20.54 USD) demonstrating significantly more performance for the money.

Some of the interesting bits from the hardware side:

  • 4 HP BladeSystem c7000 Enclosures
  • 64 HP ProLiant BL460c Servers
  • 128 Quad-Core Intel Xeon X5450 “Harpertown” Processors (512 cores)
  • 2TB Total System Memory (RAM)
  • 6 HP Oracle Exadata Storage Servers

As you can see, this was a 64 node Oracle Real Application Cluster (RAC), each node having 2 processors (8 cores). This is also the first TPC-H benchmark from Oracle that used Exadata as the storage platform.

Congratulation to the HP/Oracle team on the great accomplishment!

Transaction Processing Performance Council_1244094205417.png

Kevin Closson's Silly Little Benchmark Is Silly Fast On Nehalem

Recently Kevin Closson wrote about the absurdly simple NUMA requirements for TPC-C on the new Intel Nehalem platform and I also mentioned my excitement that databases run 2X faster on Nehalem 5500 series compared to the Intel 5400 series processors. I do have to give credit to Kevin for being excited about Nehalem (the processors, not the river, though that is a very nice fish he caught) way back in March of 2007 when he wrote:

[Nehalem] are quad core processors that are going to pack a very significant punch—much more so than the AMD Barcelona processor expected later this year [in 2007]

I also came across Kevin’s Silly Little Bechmark (SLB) and wanted to give it a run on my new Nehalem Mac Pro and see how it compares to the dual-core Opteron 800 series processor in the DL585 Kevin posted results for on his blog. Unfortunately I only have the entry level Mac Pro with the single 2.66GHz Quad-Core Intel Xeon Nehalem processor and 3 GB of memory. Although, if you have $12,000 burning a hole in your pocket you could snag a 2 x 2.93GHz Quad-Core version and stuff it with 32 GB of memory. I’m quite certain you’ll have to add a 5 point racing harness to your desk chair to operate it though. Don’t take my word for it though – see the Geekbench results. My Mac Pro Geekbench result blows away the PC system it is replacing. Anyway back to the SLB tests…

SLB Test Setup (Memhammer)

Here is the script (./slb) I ran to test out the Mac Pro:

#!/bin/bash

uname -a
date

echo "One memhammer crushing 1GB physical memory:"
./create_sem
./memhammer 262144 6000

echo "Reduce the memory to facilitate 1-4 scale-up test"
echo "1 thread .25GB:"
./create_sem
./memhammer 65536 6000

echo "2 threads .25GB each:"
./create_sem
./memhammer 65536 6000 &
./memhammer 65536 6000 &
./trigger
wait

echo "4 threads .25GB each:"
./create_sem
./memhammer 65536 6000 &
./memhammer 65536 6000 &
./memhammer 65536 6000 &
./memhammer 65536 6000 &
./trigger
wait

SLB Results

Let’s fire off memhammer and make that CPU sweat!

# ./slb
Darwin greg-rahns-mac-pro.local 9.6.3 Darwin Kernel Version 9.6.3: Tue Jan 20 18:26:40 PST 2009; root:xnu-1228.10.33~1/RELEASE_I386 i386
Fri Apr 10 11:48:06 PDT 2009
One memhammer crushing 1GB physical memory:
Total ops 1572864000  Avg nsec/op    30.6  gettimeofday usec 48204156 TPUT ops/sec 32629219.8
Reduce the memory to facilitate 1-4 scale-up test
1 thread .25GB:
Total ops 393216000  Avg nsec/op    30.7  gettimeofday usec 12059811 TPUT ops/sec 32605486.1
2 threads .25GB each:
Total ops 393216000  Avg nsec/op    32.0  gettimeofday usec 12593203 TPUT ops/sec 31224462.9
Total ops 393216000  Avg nsec/op    32.0  gettimeofday usec 12593862 TPUT ops/sec 31222829.0
4 threads .25GB each:
Total ops 393216000  Avg nsec/op    36.6  gettimeofday usec 14391042 TPUT ops/sec 27323664.3
Total ops 393216000  Avg nsec/op    36.6  gettimeofday usec 14394529 TPUT ops/sec 27317045.2
Total ops 393216000  Avg nsec/op    36.7  gettimeofday usec 14423406 TPUT ops/sec 27262354.0
Total ops 393216000  Avg nsec/op    36.7  gettimeofday usec 14449585 TPUT ops/sec 27212961.5

Pretty good stuff! Right around 33 nanoseconds per operation. Blazing fast!

2.66GHz Quad-Core Intel W3520 Nehalem Vs. AMD 2.20GHz Dual-Core Opteron 800 Series

Below are my Mac Pro results compared to the DL585 AMD results Kevin posed.

Command line: ./memhammer 262144 6000

# Mac Pro @ 2.66GHz Quad-Core Intel Xeon Nehalem (model W3520)
Total ops 1572864000 Avg nsec/op 30.7 gettimeofday usec  48292072 TPUT ops/sec 32569818.1

# HP DL585 @ 2.200GHz Dual-Core Opteron 800 series
Total ops 1572864000 Avg nsec/op 68.8 gettimeofday usec 108281130 TPUT ops/sec 14525744.2

The Mac Pro’s Nehalam processor is just over 2X as fast to perform the operation (30.7 ns vs. 68.8 ns) and thus does just over 2X the operations per second (32569818.1 ops/s vs. 14525744.2 ops/s). This Nehalem Mac Pro is simply amazing and definitely a beast on the inside. Can you imagine the compute power Virginia Tech would have if they upgraded their 324 dual-processor quad-core 2.8GHz Penryn Mac Pros to the 2.93GHz Nehalem Mac Pros?

Feel free to run SLB/Memhammer on your system and put the results in a comment.

Intel Nehalem-EP Xeon 5500 Series Processors Makes Databases Go 2X Faster

As a database performance engineer there are certain things that get me really excited. One of them is hardware. Not just any hardware, but the latest, greatest, bleeding edge stuff. It is especially exciting when the latest generation of CPUs are twice as fast as the previous generation, and those being no slouch. This is how Intel’s new Nehalem-EP Xeon 5500 series processors are shaping up.

The big launch was on March 30th so in the past few days all the benchmark reports and blog posts have been rolling in. Here are a few that I think are worth highlighting:

The SQL Server Performance Blog reports:

Pat Gelsinger did a side-by-side performance demo which launched an SSRS report, running reporting queries against a 1.5 TB SSAS OLAP cube, built using a Microsoft adCenter data set. The demo showed how Nehalem-EP is 2X faster than a Xeon 5400 on the same workload, with the same DRAM and I/O configuration. Not too shabby, but we’ve seen even faster results (~3-4X faster) on workloads which are more memory bandwidth-intensive, like data warehousing or in-memory OLAP workloads.

Intel’s Dave Hill over at the The Server Room Blog writes:

As of March 30, 2009, Intel based 2 socket Xeon® 5500 series servers set at least 30 world performance records across a wide range of benchmarks that cover virtually every application type on the market. The performance results, just by themselves, are utterly amazing, and in general they are greater than 2x the Intel® Xeon® 5400 series processors (Harpertown).

There are numerous other benchmarks listed over at the Intel® Xeon® Processor Performance summary page. Check them out. You should be nothing less than amazed. It surely is a great time to be using commodity hardware and if you are not, perhaps you should be! And for those database vendors who are using proprietary hardware like FPGAs, well, I guess you are wishing that Intel’s Nehalem-EP processors are an April Fools’ joke, but you would be wrong.

The Best Benchmarketing I've Seen Yet: Measure BI Queries In Milliseconds

After posting about how ridiculous some of the benchmarketing claims that database vendors are making, Dave Menninger, VP of Marketing & Product Management at Vertica posted a comment that one of their customers reported a 40,400x gain in one query (this of course is after I openly joked about the 16,200x Vertica claim). So I made my way over to check out this claim, and sure enough, someone reported this. Here is the table presented in the webcast:

hMetrix_Vertica.png

To this database performance engineer, this yet another unimpressive performance claim, but rather a very creative use of numbers, or maybe better put, a good case of bad math. Or better yet, big fun with small numbers. Honestly, measuring a BI query response time in milliseconds?!?! I don’t even know if OLTP database users measure their query response time in milliseconds. I simply can’t stop laughing at the fact that there needs to be precision below 1 second. Obviously BI users could not possibly tell that their query ran in less than 1 second because the network latency would mask this. Not only that, it seems there were 154 queries to choose from and the Vertica marketing crew chose to mention this one. Brilliant I say. So yes Dave, this is even more ludicrous than the 16,200x claim. At best it is a 202x gain. You won’t get credit from me (and probably others) for fractional seconds, but thanks for mentioning it. It was a good chuckle. By the way, why add two extra places of precision for this query and not all the others?

I think it is also worth mentioning that the data set size for this case is 84GB (raw) and 10.5GB in the Vertica DB (8x compression). Given the server running the database has 32GB of RAM it easily classifies as an in-memory database, so response time should certainly be in the seconds. I don’t know about you, but performance claims on a database in which the uncompressed data fits on an Apple iPod don’t excite me.

Dave Menninger also mentions:

One other piece of information in an effort of full (or at least more) disclosure is the following blog post that breaks down the orders of magnitude differences between row stores and column stores to their constituent parts.
Debunking Yet Another Myth: Column-Stores As A Storage-Layer Only Optimization

Column stores have been a topic of many research papers. The one that has caught my attention most recently is the paper by Allison Holloway and David DeWitt (Go Badgers!) entitled Read-Optimized Databases, In Depth and the VLDB 2008 presentation which has an alternate title of Yet Another Row Store vs Column Store Paper. I might suggest that you give them a read. Perhaps the crew at The Database Column will offer some comments on Allison and David’s research. I’m surprised that they haven’t already.

Well, that’s enough fun for a Friday. Time to kick off some benchmark queries on my HP Oracle Database Machine.

Database Customer Benchmarketing Reports

A few weeks ago I read Curt Monash’s report on interpreting the results of data warehouse proofs-of-concept (POCs) and I have to say, I’m quite surprised that this topic hasn’t been covered more by analysts in the data warehousing space. I understand that analysts are not database performance engineers, but where do they think that the performance claims of 10x to 100x or more come from? Do they actually investigate these claims or just report on them? I can not say that I have ever seen any database analyst offer any technical insight into these boasts of performance. If some exist be sure to leave a comment and point me to them.

Oracle Exadata Performance Architect Kevin Closson has blogged about a 485x performance increase of Oracle Exadata vs. Oracle Exadata and his follow-up post to explain exactly where the 485x performance gain comes from gave me the nudge to finish this post that had been sitting in my drafts folder since I first read Curt’s post.

Customer Bechmarketing Claims

I thought I would compile a list of what the marketing folks at other database vendors are saying about the performance of their products. Each of these statements have been taken from the given vendor’s website.

  • Netezza: 10-100 times faster than traditional solutions…but it is not uncommon to see performance differences as large as 200x to even 400x or more when compared to existing Oracle systems
  • Greenplum: often 10 to 100 times faster than traditional solutions
  • DATAllegro: 10-100x performance over traditional platforms
  • Vertica: Performs 30x-200x faster than other solutions
  • ParAccel: 20X – 200X performance gains
  • EXASolution: can perform up to 100 times faster than with traditional databases
  • Kognitio WX2: Tests have shown to out-perform other database / data warehouse solutions by 10-60 times

Certainly seems these vendors are a positioning themselves against traditional database solutions, whatever that means. And differences as large as 400x against Oracle? What is it exactly they are comparing?

Investigative Research On Netezza’s Performance Claims

Using my favorite Internet search engine I came across this presentation by Netezza dated October 2007. On slide 21 Netezza is comparing an NPS 8150 (112 SPU, up to 4.5 TB of user data) server to IBM DB2 UDB on a p680 with 12 CPUs (the existing solution). Not being extremely familiar with the IBM hardware mentioned, I thought I’d research to see exactly what an IBM p680 server consists of. The first link in my search results took me to here where the web page states:

The IBM eServer pSeries 680 has been withdrawn from the market, effective March 28, 2003.

Searching a bit more I came across this page which states that the 12 CPUs in the pSeries 680 are RS64 IV microprocessors. According to Wikipedia the “RS64-IV or Sstar was introduced in 2000 at 600 MHz, later increased to 750 MHz”. Given that at best, the p680 had 12 CPUs running at 750 MHz and the NPS 8150 had 112 440GX PowerPC processors I would give the compute advantage to Netezza by a significant margin. I guess it is cool to brag how your most current hardware beat up on some old used and abused server who has already been served its end-of-life notice. I found it especially intriguing that Netezza is boasting about beating out an IBM p680 server that has been end-of-lifed more than four years prior to the presentation’s date. Perhaps they don’t have any more recent bragging to do?

Going back one slide to #20 you will notice a comparison of Netezza and Oracle. Netezza clearly states they used a NPS 8250 (224 SPUs, up to 9 TB of user data) against Oracle 10g RAC running on Sun/EMC. Well ok…Sun/EMC what??? Obviously there were at least 2 Sun servers, since Oracle 10g RAC is involved, but they don’t mention the server models at all, nor the storage, nor the storage connectivity to the hosts. Was this two or more Sun Netra X1s or what??? Netezza boasts a 449x improvement in a “direct comparison on one day’s worth of data”. What exactly is being compared is up to the imagination. I guess this could be one query or many queries, but the marketeers intentionally fail to mention. They don’t even mention the data set size being compared. Given that Netezza can read data off the 224 drives at 60-70 MB/s, the NPS 8250 has a total scan rate of over 13 GB/s. I can tell you first hand that there are very few Sun/EMC solutions that are configured to support 13 GB/s of I/O bandwidth. Most configurations of that vintage probably don’t support 1/10th of that I/O bandwidth (1.3 GB/s).

Here are a few more comparisons that I have seen in Netezza presentations:

  • NPS 8100 (112 SPUs/4.5 TB max) vs. SAS on Sun E5500/6 CPUs/6GB RAM
  • NPS 8100 (112 SPUs/4.5 TB max) vs. Oracle 8i on Sun E6500/12 CPUs/8 GB RAM
  • NPS 8400 (448 SPUs/18 TB max) vs. Oracle on Sun (exact hardware not mentioned)
  • NPS 8100 (112 SPUs/4.5 TB max) vs. IBM SP2 (database not mentioned)
  • NPS 8150z (112 SPUs/5.5 TB max) vs. Oracle 9i on Sun/8 CPUs
  • NPS 8250z (224 SPUs/11 TB max) vs. Oracle 9i on Sun/8 CPUs

As you can see, Netezza has a way of finding the oldest hardware around and then comparing it to its latest, greatest NPS. Just like Netezza slogan, [The Power to ]Question Everything™, I suggest you question these benchmarketing reports. Database software is only as capable as the hardware it runs on and when Netezza targets the worst performing and oldest systems out there, they are bound to get some good marketing numbers. If they compete against the latest, greatest database software running on the latest, greatest hardware, sized competitively for the NPS being used, the results are drastically different. I can vouch for that one first hand having done several POCs against Netezza.

One Benchmarketing Claim To Rule Them All

Now, one of my favorite benchmarketing reports is one from Vertica. Michael Stonebraker’s blog post on customer benchmarks contains the following table:

vertica_benchmark_table.png

Take a good look at the Query 2 results. Vertica takes a query running in the current row store from running in 4.5 hours (16,200 seconds) to 1 second for a performance gain of 16,200x. Great googly moogly batman, that is reaching ludicrous speed. Heck, who needs 100x or 400x when you do 16,200x. That surely warrants an explanation of the techniques involved there. It’s much, much more than simply column store vs. row store. It does raise the question (at least to me): why Vertica doesn’t run every query in 1 second. I mean, come on, why doesn’t that 19 minute row store query score better than a 30x gain? Obviously there is a bit of the magic pixie dust going on here with, what I would refer to as “creative solutions” (in reality it is likely just a very well designed projection/materaizied view, but by showing the query and telling us how it was possible would make it less unimpressive [sic]).

What Is Really Going On Here

First of all, you will notice that not one of these benchmarketing claims is against a vendor run system. Each and every one of these claims are against existing customer systems. The main reason for this is that most vendors prohibit benchmark results being published with out prior consent from the vendor in the licensing agreement. Seems the creative types have found that taking the numbers from the existing, production system is not prohibited in the license agreement so they compare that to their latest, greatest hardware/software and execute or supervise the execution of a benchmark on their solution. Obviously this is a one sided apples to bicycles comparison, but quite favorable for bragging rights for the new guy.

I’ve been doing customer benchmarks and proof of concepts (POCs) for almost 5 years at Oracle. I can guarantee you that Netezza has never even come close to getting 10x-100x the performance over Oracle running on a competitive hardware platform. Now I can say that it is not uncommon for Oracle running on a balanced system to perform 10x to 1000x (ok, in extreme cases) over an existing poorly performing Oracle system. All it takes is to have a very unbalanced system with no I/O bandwidth, not be using parallel query, not use compression, poor or no use of partitioning and you have created a springboard for any vendor to look good.

One More Juicy Marketing Tidbit

While searching the Internet for creative marketing reports I have to admit that the crew at ParAccel probably takes the cake (and not in an impressive way). On one of their web pages they have these bullet points (plus a few more uninteresting ones):

  • All operations are done in parallel (A non-parallel DBMS must scan all of the data sequentially)
  • Adaptive compression makes disks faster…

Ok, so I can kinda, sorta see the point that a non-parallel DBMS must do something sequentially…not sure how else it would do it, but then again, I don’t know any enterprise database that is not capable of parallel operations. However, I’m going to need a bit of help on the second point there…how exactly does compression make disks faster? Disks are disks. Whether or not compression is involved has nothing to do with how fast a disk is. Perhaps they mean that compression can increase the logical read rate from a disk given that compression allows more data to be stored in the same “space” on the disk, but that clearly is not what they have written. Reminds me of DATAllegro’s faster-than-wirespeed claims on scan performance. Perhaps these marketing guys should have their numbers and wording validated by some engineers.

Do You Believe In Magic Or Word Games?

Creditable performance claims need to be accounted for and explained. Neil Raden from Hired Brains Research offers guidance for evaluating benchmarks and interpreting market messaging in his paper, Questions to Ask a Data Warehouse Appliance Vendor. I think Neil shares the same opinion of these silly benchmarketing claims. Give his paper a read.

Glenn Fawcetts's Oracle Analysis 101

Glenn Fawcett has put together a nice slide deck entitled “Oracle Analysis 101” that discusses techniques to collect and analyze performance data related to Oracle databases. It’s a good read so check it out. Glenn is also a great resource for Oracle on Sun Niagara 2 Processors (CMT) so be sure to peek at his other blog posts as well.