Archive

Archive for the ‘VLDB’ Category

The Core Performance Fundamentals Of Oracle Data Warehousing – Partitioning

January 25th, 2010

Partitioning is an essential performance feature for an Oracle data warehouse because partition elimination (or partition pruning) generally results in the elimination of a significant amount of table data to be scanned. This results in a need for less system resources and improved query performance. Someone once told me “the fastest I/O is the one that never happens.” This is precisely the reason that partitioning is a must for Oracle data warehouses – it’s a huge I/O eliminator. I frequently refer to partition elimination as the anti-index. An index is used to find a small amount data that is required; partitioning is used to eliminate vasts amounts of data that is not required.

Main Uses For Partitioning

I would classify the main reasons to use partitioning in your Oracle data warehouse into these four areas:

  • Data Elimination
  • Partition-Wise Joins
  • Manageability (Partition Exchange Load, Local Indexes, etc.)
  • Information Lifecycle Management (ILM)

Partitioning Basics

The most common partitioning design pattern found in Oracle data warehouses is to partition the fact tables by range (or interval) on the event date/time column. This allows for partition elimination of all the data not in the desired time window in queries. For example: If I have a fact table that contains point of sale (POS) data, each line item for a given transaction has a time stamp of when the item was scanned. Let’s say this value is stored in column EVENT_TS which is a DATE or TIMESTAMP data type. In most cases it would make sense to partition by range on EVENT_TS using one day partitions. This means every query that uses a predicate filter on EVENT_TS (which should be nearly every one) can eliminate significant amounts of data that is not required to satisfy the query predicate.  If you want to look at yesterday’s sales numbers, there is no need to bring back rows from last week or last month!

Subpartitioning Options

Depending on the schema design of your data warehouse you may also chose to subpartition a table. This allows one to further segment a table to allow for even more data elimination or it can allow for partition-wise joins which allow for reduced usage of CPU and memory resources by minimizing the amount of data exchanged between parallel execution server processes. In third normal form (3NF) schemas it is very beneficial to use hash partitioning or subpartitioning to allow for partition-wise joins (see Oracle Parallel Execution: Interconnect Myths And Misunderstandings) for this exact reason. Dimensional models (star schemas) may also benefit from hash subpartitioning and partition-wise joins. Generally it is best to hash subpartition on a join key column to a very large dimension, like CUSTOMER, so that a partition-wise join will be used between the fact table and the large dimension table.

Manageability

Managing large objects can be challenging for a variety of reasons which is why Oracle Partitioning allows for many operations to be done at a global or partition (or subpartition) level.  This makes it much easier to deal with tables or indexes of large sizes.  It also is transparent to applications so the SQL that runs against a non-partitioned object will run as-is against a partitioned object.  Some of the key features include:

  • Partition Exchange Load – Data can be loaded “out of line” and exchanged into a partitioned table.
  • Local Indexes – It takes much less time to build local indexes than global indexes.
  • Compression – Can be applied at the segment level so it’s possible to have a mix of compressed and non-compressed partitions.
  • Segment Moves/Rebuilds/Truncates/Drops – Each partition (or subpartition) is a segment and can be operated on individually and independently of the other partitions in the table.
  • Information Lifecycle Management (ILM) – Partitioning allows implementation of an ILM strategy.

Summary

I’d classify partitioning as a “must have” for Oracle data warehouses for both the performance and manageability reasons described above. Partitioning should lower query response time as well as resource utilization do to “smart” data access (only go after the data the query needs). There are additional partitioning design patterns as well and the Oracle documentation contains descriptions of them as well as examples.

Oracle Documentation References:

Post to Twitter Tweet This Post

Data Warehousing, Oracle, Performance, VLDB

The Core Performance Fundamentals Of Oracle Data Warehousing – Table Compression

January 19th, 2010

Editor’s note: This blog post does not cover Exadata Hybrid Columnar Compression.

The first thing that comes to most people’s mind when database table compression is mentioned is the savings it yields in terms of disk space. While reducing the footprint of data on disk is relevant, I would argue it is the lesser of the benefits for data warehouses. Disk capacity is very cheap and generally plentiful, however, disk bandwidth (scan speed) is proportional to the number of spindles, no mater what the disk capacity and thus is more expensive. Table compression reduces the footprint on the disk drives that a given data set occupies so the amount of physical data that must be read off the disk platters is reduced when compared to the uncompressed version. For example, if 4000 GB of raw data can compress to 1000 GB, it can be read off the same disk drives 4X as fast because it is reading and transferring 1/4 of the data off the spindles (relative to the uncompressed size). Likewise, table compression allows for the database buffer cache to contain more data without having to increase the memory allocation because more rows can be stored in a compressed block/page compared to an uncompressed block/page.

Row major table compression comes in two flavors with the Oracle database: BASIC and OLTP.  In 11.1 these were also known by the key phrases COMPRESS or COMPRESS FOR DIRECT_LOAD OPERATIONS and COMPRESS FOR ALL OPERATIONS.  The BASIC/DIRECT_LOAD compression has been part of the Oracle database since version 9 and ALL OPERATIONS/OLTP compression was introduced in 11.1 with the Advanced Compression option.

Oracle row major table compression works by storing the column values for a given block in a symbol table at the beginning of the block. The more repeated values per block, even across columns, the better the compression ratio. Sorting data can increase the compression ratio as ordering the data will generally allow more repeat values per block. Specific compression ratios and gains from sorting data are very data dependent but compression ratios are generally between 2x and 4x.

Compression does add some CPU overhead when direct path loading data, but there is no measurable performance overhead when reading data as the Oracle database can operate on compressed blocks directly without having to first uncompress the block. The additional CPU required when bulk loading data is generally well worth the down wind gains for data warehouses. This is because most data in a well designed data warehouse is write once, read many times. Insert only and infrequently modified tables are ideal candidates for BASIC compression.  If the tables have significant DML performed against them, then OLTP compression would be advised (or no compression).

Given that most Oracle data warehouses that I have seen are constrained by I/O bandwidth (see Balanced Hardware Configuration) it is highly recommended to leverage compression so the logical table scan rate can increase proportionally to the compression ratio.  This will result in faster table and partition scans on the same hardware.

Oracle Documentation References:

Post to Twitter Tweet This Post

Data Warehousing, Oracle, Performance, VLDB

The Core Performance Fundamentals Of Oracle Data Warehousing – Balanced Hardware Configuration

December 22nd, 2009

If you want to build a house that will stand the test of time, you need to build on a solid foundation. The same goes for architecting computer systems that run databases. If the underlying hardware is not sized appropriately it will likely lead to people blaming software. All too often I see data warehouse systems that are poorly architected for the given workload requirements. I frequently tell people, “you can’t squeeze blood from a turnip“, meaning if the hardware resources are not there for the software to use, how can you expect the software to scale?

Undersizing data warehouse systems has become an epidemic with open platforms – platforms that let you run on any brand and configuration of hardware. This problem has been magnified over time as the size of databases have grown significantly, and generally outpacing the experience of those managing them. This has caused the “big three” database vendors to come up with suggested or recommended hardware configurations for their database platforms:

Simply put, the reasoning behind those initiatives was to help customers architect systems that are well balanced and sized appropriately for the size of their data warehouse.

Balanced Hardware Configurations

The foundation for a well performing data warehouse (or any system for that matter) is the hardware that it runs on. There are three main hardware resources to consider when sizing your data warehouse hardware. Those are:

  1. Number of CPUs
  2. Number of storage devices (HDDs or SSDs)
  3. I/O bandwidth between CPUs and storage devices

NB: I’ve purposely left off memory (RAM) as most systems are pretty well sized at 2GB or 4GB per CPU core these days.

A balanced system has the following characteristics:

dd459146image013en-us.jpg

As you can see, each of the three components are sized proportionally to each other. This allows for the max system throughput capacity as no single resource will become the bottleneck before any other. This was one of the critical design decisions that went into the Oracle Database Machine.

Most DBAs and System Admins know what the disk capacity numbers are for their systems, but when it comes to I/O bandwidth or scan rates, most are unaware of what the system is capable of in theory, let alone in practice. Perhaps I/O bandwidth utilization should be included in the system metrics that are collected for your databases. You do collect system metrics, right?

There are several “exchanges” that data must flow through from storage devices to host CPUs, many of which could become bottlenecks. Those include:

  • Back-end Fibre Channel loops (the fibre between the drive shelves and the storage array server processor)
  • Front-end Fibre Channel ports
  • Storage array server processors (SP)
  • Host HBAs

One should understand the throughput capacity of each of these components to ensure that one (or more) of them do not restrict the flow of data to the CPUs prematurely.

Unbalanced Hardware Configurations

All too frequently systems are not architected as balanced systems and the system ends up being constrained in one of the following three scenarios:

dd459146image011en-us.jpg

From the production systems that I have seen, the main deficiency is in I/O bandwidth (both I/O Channel and HDD). I believe there are several reasons for this. First, too many companies capacity plan for their data warehouse based on the size the data occupies on disk alone. That is, they purchase the number of HDDs for the system based on the drive capacity, not on the I/O bandwidth requirement. Think of it like this: If you were to purchase 2 TB of mirrored disk capacity (4 TB total) would you rather purchase 28 x 146 GB drives or 14 x 300 GB drives (or even 4 x 1 TB drives)? You may ask: Well, what is the difference (other than price); in each case you have the same net capacity, correct? Indeed, both configurations do have the same capacity, but I/O bandwidth (how fast you can read data off the HDDs) is proportional to the number of HDDs, not the capacity. Thus it should be slightly obvious then that 28 HDDs can deliver 2X the disk I/O bandwidth that 14 HDDs can. This means that it will take 2X as long to read the same amount of data off of 14 HDDs as 28 HDDs.

Unfortunately what tends to happen is that the bean counter types will see only two things:

  1. The disk capacity (space) you want to purchase (or the capacity that is required)
  2. The price per MB/GB/TB

This is where someone worthy of the the title systems architect needs to stand up and explain the concept of I/O bandwidth and the impact it has on data warehouse performance (your systems architect does know this, correct?). This is generally a difficult discussion because I/O bandwidth is not a line item on a purchase order, it is a derived metric that requires both thought and engineering (which means someone had to do some thinking about the requirements for this system!).

Summary

When sizing the hardware for your data warehouse consider your workload and understand following (and calculate numbers for them!):

  1. What rate (in MB/GB per second) can the CPUs consume data?
  2. What rate can storage devices produce data (scan rate)?
  3. What rate can the data be delivered from the storage array(s) to the host HBAs?

If you are unable to answer these questions in theory then you need to sit down and do some calculations. Then you need to use some micro benchmarks (like Oracle ORION) and prove out those calculations. This will give you the “speed limit” and an metric by which you can measure your database workload against. All computer systems much obey the laws of physics! There is no way around that.

Additional Readings

Kevin Closson has several good blog posts on related topics including:

as well as numerous others.

Oracle Documentation References:

Post to Twitter Tweet This Post

Data Warehousing, Oracle, Performance, VLDB

The Core Performance Fundamentals Of Oracle Data Warehousing – Introduction

December 14th, 2009

At the 2009 Oracle OpenWorld Unconference back in October I lead a chalk and talk session entitled The Core Performance Fundamentals Of Oracle Data Warehousing. Since this was a chalk and talk I spared the audience any powerpoint slides but I had several people request that make it into a presentation so they could share it with others. After some thought, I decided that a series of blog posts would probably be a better way to share this information, especially since I tend to use slides as a speaking outline, not a condensed version of a white paper. This will be the first of a series of posts discussing what I consider to be the key features and technologies behind well performing Oracle data warehouses.

Introduction

As an Oracle database performance engineer who has done numerous customer data warehouse benchmarks and POCs over the past 5+ years, I’ve seen many data warehouse systems that have been plagued with problems on nearly every DBMS commonly used in data warehousing. Interestingly enough, many of these systems were facing many of the same problems. I’ve compiled a list of topics that I consider to be key features and/or technologies for Oracle data warehouses:

Core Performance Fundamental Topics

In the upcoming posts, I’ll deep dive into each one of these topics discussing why these areas are key for a well performing Oracle data warehouse. Stay tuned…

Post to Twitter Tweet This Post

Data Warehousing, Oracle, Performance, VLDB

Oracle OpenWorld 2009: The Real-World Performance Group

July 20th, 2009

Even though Oracle OpenWorld 2009 is a few months away, I thought I would take a moment to mention that the Oracle Real-World Performance Group will again be hosting three sessions. Hopefully you are no stranger to our Oracle database performance sessions and this year we have what I think will be a very exciting and enlightening session: The Terabyte Hour with the Real-World Performance Group. If you are the slightest bit interested in seeing just how fast the Oracle Database Machine really is and how it can devour flat files in no time, rip through and bend data at amazing speeds, this is the session for you. All the operations will be done live for you to observe. No smoke. No mirrors. Pure Exadata performance revealed.

Session ID: S311237
Session Title: Real-World Database Performance Roundtable
Session Abstract: This session is a panel discussion including Oracle’s Real-World Performance Group and other invited performance experts. To make the hour productive, attendees need to write their questions on postcards and hand them to the panel at the beginning of the session. The questions should stick to the subject matter of real-world database performance. The panel members look forward to meeting you and answering your questions.
   
Session ID: S311239
Session Title: The Terabyte Hour with the Real-World Performance Group
Session Abstract: Last year at Oracle OpenWorld, Oracle launched the Oracle Database Machine, a complete package of software, servers, and storage with the power to tackle large-scale business intelligence problems immediately and scale linearly as your data warehouse grows. In this session, Oracle’s Real-World Performance Group demonstrates how to use an Oracle Database Machine to load, transform, and query a 1-terabyte database in less than an hour. The demonstration shows techniques for exploiting full database parallelism in a simple but optimal manner.
   
Session ID: S311238
Session Title: Current Trends in Real-World Database Performance
Session Abstract: The year 2009 has been an exciting one for Oracle’s Real-World Performance Group. The group has been challenged by bigger databases, new performance challenges, and now the Oracle Database Machine with Oracle Exadata Storage Server. This session focuses on some of the real-world performance ideas and solutions that have worked over the last year, including performance design philosophies, best practices, and a few tricks and tips.

Post to Twitter Tweet This Post

Exadata, Oracle, Performance, VLDB

Oracle Parallel Execution: Interconnect Myths And Misunderstandings

July 6th, 2009

A number of weeks back I had come across a paper/presentation by Riyaj Shamsudeen entitled Battle of the Nodes: RAC Performance Myths (avaiable here). As I was looking through it I saw one example that struck me as very odd (Myth #3 – Interconnect Performance) and I contacted him about it. After further review Riyaj commented that he had made a mistake in his analysis and offered up a new example. I thought I’d take the time to discuss this as parallel execution seems to be one of those areas where many misconceptions and misunderstandings exist.

The Original Example

I thought I’d quickly discuss why I questioned the initial example. The original query Riyaj cited is this one:

select /*+ full(tl) parallel (tl,4) */
       avg (n1),
       max (n1),
       avg (n2),
       max (n2),
       max (v1)
from   t_large tl;

As you can see this is a very simple single table aggregation without a group by. The reason that I questioned the validity of this example in the context of interconnect performance is that the parallel execution servers (parallel query slaves) will each return exactly one row from the aggregation and then send that single row to the query coordinator (QC) which will then perform the final aggregation. Given that, it would seem impossible that this query could cause any interconnect issues at all.

Riyaj’s Test Case #1

Recognizing the original example was somehow flawed, Riyaj came up with a new example (I’ll reference as TC#1) which consisted of the following query:

select /*+ parallel (t1, 8,2) parallel (t2, 8, 2)  */
       min (t1.customer_trx_line_id + t2.customer_trx_line_id),
       max (t1.set_of_books_id + t2.set_of_books_id),
       avg (t1.set_of_books_id + t2.set_of_books_id),
       avg (t1.quantity_ordered + t2.quantity_ordered),
       max (t1.attribute_category),
       max (t2.attribute1),
       max (t1.attribute2)
from   (select *
        from   big_table
        where  rownum <= 100000000) t1,
       (select *
        from   big_table
        where  rownum <= 100000000) t2
where  t1.customer_trx_line_id = t2.customer_trx_line_id;

The execution plan for this query is:

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |           |     1 |   249 |       |  2846K  (4)| 01:59:01 |        |      |            |
|   1 |  SORT AGGREGATE           |           |     1 |   249 |       |            |          |        |      |            |
|*  2 |   HASH JOIN               |           |   100M|    23G|   762M|  2846K  (4)| 01:59:01 |        |      |            |
|   3 |    VIEW                   |           |   100M|    10G|       |  1214K  (5)| 00:50:46 |        |      |            |
|*  4 |     COUNT STOPKEY         |           |       |       |       |            |          |        |      |            |
|   5 |      PX COORDINATOR       |           |       |       |       |            |          |        |      |            |
|   6 |       PX SEND QC (RANDOM) | :TQ10000  |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | P->S | QC (RAND)  |
|*  7 |        COUNT STOPKEY      |           |       |       |       |            |          |  Q1,00 | PCWC |            |
|   8 |         PX BLOCK ITERATOR |           |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | PCWC |            |
|   9 |          TABLE ACCESS FULL| BIG_TABLE |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | PCWP |            |
|  10 |    VIEW                   |           |   100M|    12G|       |  1214K  (5)| 00:50:46 |        |      |            |
|* 11 |     COUNT STOPKEY         |           |       |       |       |            |          |        |      |            |
|  12 |      PX COORDINATOR       |           |       |       |       |            |          |        |      |            |
|  13 |       PX SEND QC (RANDOM) | :TQ20000  |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | P->S | QC (RAND)  |
|* 14 |        COUNT STOPKEY      |           |       |       |       |            |          |  Q2,00 | PCWC |            |
|  15 |         PX BLOCK ITERATOR |           |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | PCWC |            |
|  16 |          TABLE ACCESS FULL| BIG_TABLE |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."n1"="T2"."n1")
   4 - filter(ROWNUM<=100000000)
   7 - filter(ROWNUM<=100000000)
  11 - filter(ROWNUM<=100000000)
  14 - filter(ROWNUM<=100000000)

This is a rather synthetic query but there are a few things that I would like to point out. First, this query uses a parallel hint with 3 values representing table/degree/instances, however instances has been deprecated (see 10.2 parallel hint documentation). In this case the DOP is calculated by degree * instances or 16, not DOP=8 involving 2 instances. Note that the rownum filter is causing all the rows from the tables to be sent back to the QC for the COUNT STOPKEY operation thus causing the execution plan to serialize, denoted by the P->S in the IN-OUT column.

Riyaj had enabled sql trace for the QC and the TKProf output is such:

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  SORT AGGREGATE (cr=152 pr=701158 pw=701127 time=1510221226 us)
98976295   HASH JOIN  (cr=152 pr=701158 pw=701127 time=1244490336 us)
100000000    VIEW  (cr=76 pr=0 pw=0 time=200279054 us)
100000000     COUNT STOPKEY (cr=76 pr=0 pw=0 time=200279023 us)
100000000      PX COORDINATOR  (cr=76 pr=0 pw=0 time=100270084 us)
      0       PX SEND QC (RANDOM) :TQ10000 (cr=0 pr=0 pw=0 time=0 us)
      0        COUNT STOPKEY (cr=0 pr=0 pw=0 time=0 us)
      0         PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us)
      0          TABLE ACCESS FULL BIG_TABLE_NAME_CHANGED_12 (cr=0 pr=0 pw=0 time=0 us)
100000000    VIEW  (cr=76 pr=0 pw=0 time=300298770 us)
100000000     COUNT STOPKEY (cr=76 pr=0 pw=0 time=200298726 us)
100000000      PX COORDINATOR  (cr=76 pr=0 pw=0 time=200279954 us)
      0       PX SEND QC (RANDOM) :TQ20000 (cr=0 pr=0 pw=0 time=0 us)
      0        COUNT STOPKEY (cr=0 pr=0 pw=0 time=0 us)
      0         PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us)
      0          TABLE ACCESS FULL BIG_TABLE_NAME_CHANGED_12 (cr=0 pr=0 pw=0 time=0 us)

Note that the Rows column contains zeros for many of the row sources because this trace is only for the QC, not the slaves, and thus only QC rows will show up in the trace file. Something to be aware of if you decide to use sql trace with parallel execution.

Off To The Lab: Myth Busting Or Bust!

I wanted to take a query like TC#1 and run it in my own environment so I could do more monitoring of it. Given the alleged myth had to do with interconnect traffic of cross-instance (inter-node) parallel execution, I wanted to be certain to gather the appropriate data. I ran several tests using a similar query on a similar sized data set (by row count) as the initial example. I ran all my experiments on a Oracle Real Application Clusters version 11.1.0.7 consisting of eight nodes, each with two quad-core CPUs. The interconnect is InfiniBand and the protocol used is RDS (Reliable Datagram Sockets).

Before I get into the experiments I think it is worth mentioning that Oracle's parallel execution (PX), which includes Parallel Query (PQ), PDML & PDDL, can consume vast amounts of resources. This is by design. You see, the idea of Oracle PX is to dedicate a large amount of resources (processes) to a problem by breaking it up into many smaller pieces and then operate on those pieces in parallel. Thus the more parallelism that is used to solve a problem, the more resources it will consume, assuming those resources are available. That should be fairly obvious, but I think it is worth stating.

For my experiments I used a table that contains just over 468M rows.

Below is my version of TC#1. The query is a self-join on a unique key and the table is range partitioned by DAY_KEY into 31 partitions. Note that I create a AWR snapshot immediately before and after the query.

exec dbms_workload_repository.create_snapshot

select /* &amp;&amp;1 */
       /*+ parallel (t1, 16) parallel (t2, 16) */
       min (t1.bsns_unit_key + t2.bsns_unit_key),
       max (t1.day_key + t2.day_key),
       avg (t1.day_key + t2.day_key),
       max (t1.bsns_unit_typ_cd),
       max (t2.curr_ind),
       max (t1.load_dt)
from   dwb_rtl_trx t1,
       dwb_rtl_trx t2
where  t1.trx_nbr = t2.trx_nbr;

exec dbms_workload_repository.create_snapshot

Experiment Results Using Fixed DOP=16

I ran my version of TC#1 across a varying number of nodes by using Oracle services (instance_groups and parallel_instance_group have been deprecated in 11g), but kept the DOP constant at 16 for all the tests. Below is a table of the experiment results.

Nodes
Elapsed Time
SQL Monitor Report
AWR Report
AWR SQL Report
1
00:04:54.12
2
00:03:55.35
4
00:02:59.24
8
00:02:14.39

Seemingly contrary to what many people would probably guess, the execution times got better the more nodes that participated in the query even though the DOP constant throughout each of tests.

Measuring The Interconnect Traffic

One of the new additions to the AWR report in 11g was the inclusion of interconnect traffic by client. This section is near the bottom of the report and looks like such:
Interconnect Throughput By Client
This allows the PQ message traffic to be tracked, whereas in prior releases it was not.

Even though AWR contains the throughput numbers (as in megabytes per second) I thought it would be interesting to see how much data was being transferred, so I used the following query directly against the AWR data. I put a filter predicate on to return only where there DIFF_RECEIVED_MB >= 10MB so the instances that were not part of the execution are filtered out, as well as the single instance execution.

break on snap_id skip 1
compute sum of DIFF_RECEIVED_MB on SNAP_ID
compute sum of DIFF_SENT_MB on SNAP_ID

select *
from   (select   snap_id,
                 instance_number,
                 round ((bytes_sent - lag (bytes_sent, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_sent_mb,
                 round ((bytes_received - lag (bytes_received, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_received_mb
        from     dba_hist_ic_client_stats
        where    name = 'ipq' and
                 snap_id between 910 and 917
        order by snap_id,
                 instance_number)
where  snap_id in (911, 913, 915, 917) and
       diff_received_mb &gt;= 10
/

SNAP_ID    INSTANCE_NUMBER DIFF_SENT_MB DIFF_RECEIVED_MB
---------- --------------- ------------ ----------------
       913               1        11604            10688
                         2        10690            11584
**********                 ------------ ----------------
sum                               22294            22272

       915               1         8353             8350
                         2         8133             8418
                         3         8396             8336
                         4         8514             8299
**********                 ------------ ----------------
sum                               33396            33403

       917               1         5033             4853
                         2         4758             4888
                         3         4956             4863
                         4         5029             4852
                         5         4892             4871
                         6         4745             4890
                         7         4753             4889
                         8         4821             4881
**********                 ------------ ----------------
sum                               38987            38987

As you can see from the data, the more nodes that were involved in the execution, the more interconnect traffic there was, however, the execution times were best with 8 nodes.

Further Explanation Of Riyaj's Issue

If you read Riyaj's post, you noticed that he observed worse, not better as I did, elapsed times when running on two nodes versus one. How could this be? It was noted in the comment thread of that post that the configuration was using Gig-E as the interconnect in a Solaris IPMP active-passive configuration. This means the interconnect speeds would be capped at 128MB/s (1000Mbps), the wire speed of Gig-E. This is by all means is an inadequate configuration to use cross-instance parallel execution.

There is a whitepaper entitled Oracle SQL Parallel Execution that discusses many of the aspects of Oracle's parallel execution. I would highly recommend reading it. This paper specifically mentions:

If you use a relatively weak interconnect, relative to the I/O bandwidth from the server to the storage configuration, then you may be better of restricting parallel execution to a single node or to a limited number of nodes; inter-node parallel execution will not scale with an undersized interconnect.

I would assert that this is precisely the root cause (insufficient interconnect bandwidth for cross-instance PX) behind the issues that Riyaj observed, thus making his execution slower on two nodes than one node.

The Advantage Of Hash Partitioning/Subpartitioning And Full Partition-Wise Joins

At the end of my comment on Riyaj's blog, I mentioned:

If a DW frequently uses large table to large table joins, then hash partitioning or subpartitioning would yield added gains as partition-wise joins will be used.

I thought that it would be both beneficial and educational to extend TC#1 and implement hash subpartitioning so that the impact could be measured on both query elapsed time and interconnect traffic. In order for a full partition-wise join to take place, the table must be partitioned/subpartitioned on the join key column, so in this case I've hash subpartitioned on TRX_NBR. See the Oracle Documentation on Partition-Wise Joins for a more detailed discussion on PWJ.

Off To The Lab: Partition-Wise Joins

I've run through the exact same test matrix with the new range/hash partitioning model and below are the results.

Nodes
Elapsed Time
SQL Monitor Report
AWR Report
AWR SQL Report
1
00:02:42.41
2
00:01:37.29
4
00:01:12.82
8
00:01:05.04

As you can see by the elapsed times, the range/hash partitioning model with the full partition-wise join has decreased the overall execution time by around a factor of 2X compared to the range only partitioned version.

Now let's take a look at the interconnect traffic for the PX messages:

break on snap_id skip 1
compute sum of DIFF_RECEIVED_MB on SNAP_ID
compute sum of DIFF_SENT_MB on SNAP_ID

select *
from   (select   snap_id,
                 instance_number,
                 round ((bytes_sent - lag (bytes_sent, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_sent_mb,
                 round ((bytes_received - lag (bytes_received, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_received_mb
        from     dba_hist_ic_client_stats
        where    name = 'ipq' and
                 snap_id between 1041 and 1048
        order by snap_id,
                 instance_number)
where  snap_id in (1042,1044,1046,1048) and
       diff_received_mb &gt;= 10
/
no rows selected

Hmm. No rows selected?!? I had previously put in the predicate "DIFF_RECEIVED_MB >= 10MB" to filter out the nodes that were not participating in the parallel execution. Let me remove that predicate rerun the query.

   SNAP_ID INSTANCE_NUMBER DIFF_SENT_MB DIFF_RECEIVED_MB
---------- --------------- ------------ ----------------
      1042               1            8                6
                         2            2                3
                         3            2                3
                         4            2                3
                         5            2                3
                         6            2                3
                         7            2                3
                         8            2                3
**********                 ------------ ----------------
sum                                  22               27

      1044               1            7                7
                         2            3                2
                         3            2                2
                         4            2                2
                         5            2                2
                         6            2                2
                         7            2                2
                         8            2                2
**********                 ------------ ----------------
sum                                  22               21

      1046               1            1                2
                         2            1                2
                         3            1                2
                         4            3                1
                         5            1                1
                         6            1                1
                         7            1                1
                         8            1                1
**********                 ------------ ----------------
sum                                  10               11

      1048               1            6                5
                         2            1                2
                         3            3                2
                         4            1                2
                         5            1                2
                         6            1                2
                         7            1                2
                         8            1                2
**********                 ------------ ----------------
sum                                  15               19

Wow, there is almost no interconnect traffic at all. Let me verify with the AWR report from the 8 node execution.

The AWR report confirms that there is next to no interconnect traffic for the PWJ version of TC#1. The reason for this is that since the table is hash subpartitoned on the join column each of the subpartitions can be joined to each other minimizing the data sent between parallel execution servers. If you look at the execution plan (see the AWR SQL Report) from the first set of experiments you will notice that the broadcast method for each of the tables is HASH but in the range/hash version of TC#1 there is no broadcast at all for either of the two tables. The full partition-wise join behaves logically the same way that a shared-nothing database would; each of the parallel execution servers works on its partition which does not require data from any other partition because of the hash partitioning on the join column. The main difference is that in a shared-nothing database the data is physically hash distributed amongst the nodes (each node contains a subset of all the data) where as all nodes in a Oracle RAC database have access to the all the data.

Parting Thoughts

Personally I see no myth about cross-instance (inter-node) parallel execution and interconnect traffic, but frequently I see misunderstandings and misconceptions. As shown by the data in my experiment, TC#1 (w/o hash subpartitioning) running on eight nodes is more than 2X faster than running on one node using exactly the same DOP. Interconnect traffic is not a bad thing as long as the interconnect is designed to support the workload. Sizing the interconnect is really no different than sizing any other component of your cluster (memory/CPU/disk space/storage bandwidth). If it is undersized, performance will suffer. Depending on the number and speed of the host CPUs and the speed and bandwidth of the interconnect, your results may vary.

By hash subpartioning the table the interconnect traffic was all but eliminated and the query execution times were around 2X faster than the non-hash subpartition version of TC#1. This is obviously a much more scalable solution and one of the main reasons to leverage hash (sub)partitioning in a data warehouse.

Post to Twitter Tweet This Post

Data Warehousing, Oracle, Parallel Execution, Performance, VLDB

Facebook: Hive – A Petabyte Scale Data Warehouse Using Hadoop

June 10th, 2009

Today, June 10th, marks the Yahoo! Hadoop Summit ‘09 and the crew at Facebook have a writeup on the Facebook Engineering page entitled: Hive – A Petabyte Scale Data Warehouse Using Hadoop.

I found this an very interesting read given some of the Hadoop/MapReduce comments from David J. DeWitt and Michael Stonebraker as well as their SIGMOD 2009 paper, A Comparison of Approaches to Large-Scale Data Analysis. Now I’m not about to jump into this whole dbms-is-better-than-mapreduce argument but I found Facebook’s story line interesting:

When we started at Facebook in 2007 all of the data processing infrastructure was built around a data warehouse built using a commercial RDBMS. The data that we were generating was growing very fast – as an example we grew from a 15TB data set in 2007 to a 2PB data set today. The infrastructure at that time was so inadequate that some daily data processing jobs were taking more than a day to process and the situation was just getting worse with every passing day. We had an urgent need for infrastructure that could scale along with our data and it was at that time we then started exploring Hadoop as a way to address our scaling needs.

[The] Hive/Hadoop cluster at Facebook stores more than 2PB of uncompressed data and routinely loads 15 TB of data daily

Wow, 2PB of uncompressed data and growing at around 15TB daily. A part of me wonders how much value there is in 2PB of data or if companies are suffering from OCD when it comes to data. Either way it’s interesting to see how much data is being generated/collected and how engineers are dealing with it.

Post to Twitter Tweet This Post

Data Warehousing, VLDB

The Impact Of Good Table And Query Design

March 19th, 2009

There are many ways to design tables/schemas and many ways to write SQL queries that execute against those tables/schemas. Some designs are better than others for various reasons, however, I think that frequently people underestimate the power of SQL (for both “good” and “evil”). All too often in data warehouses, I see tables designed for one specific report, or a very select few reports. These tables frequently resemble Microsoft Excel Spreadsheets (generally Pivot Tables), not good Dimensional (Star Schema) or Third Normal Form (3NF) schema design. The problem with such designs is that it severely limits the usefulness of that data, as queries that were not known at the time of design often time become problematic. The following is a simple one table example, derived from a field experience in which I discuss two table designs and provide the SQL queries to answer a question the business is seeking.

The Business Question

First lets start with the business question for which the answer is being sought.
What customers meet the following criteria:

  • do not own PRODUCT1 or PRODUCT2 but have downloaded SOFTWARE
  • do not own PRODUCT2 and it has been more than 90 days between SOFTWARE download and their purchase of PRODUCT1

Version 1: The Column Based (Pivot) Table Design

For Version 1, there is a single row for each customer and each attribute has its own column. In this case there are 4 columns, each representing the most recent activity date for that product.

SQL> desc column_tab
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUSTOMER_ID                               NOT NULL NUMBER
 SOFTWARE_MAC_RECENCY_TS                            DATE
 SOFTWARE_WIN_RECENCY_TS                            DATE
 PRODUCT1_RECENCY_TS                                DATE
 PRODUCT2_RECENCY_TS                                DATE

SQL> select * from column_tab;

CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
----------- ---------- ---------- ---------- ----------
        100 2009-03-17            2008-11-17
        200 2009-03-17            2009-01-16
        300 2009-03-17            2008-10-08 2009-02-25
        400            2009-03-17 2008-11-07
        500 2009-03-17

5 rows selected.

SQL> select customer_id
  2  from   column_tab
  3  where  product2_recency_ts is null and
  4         (((software_win_recency_ts is not null or
  5            software_mac_recency_ts is not null) and
  6           product1_recency_ts is null) or
  7          ((software_win_recency_ts - product1_recency_ts) > 90 or
  8           (software_mac_recency_ts - product1_recency_ts) > 90));

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 4293700422

--------------------------------------------------------------------------------
| Id  | Operation         | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            |     2 |    42 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| COLUMN_TAB |     2 |    42 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("PRODUCT2_RECENCY_TS" IS NULL AND ("PRODUCT1_RECENCY_TS"
              IS NULL AND ("SOFTWARE_MAC_RECENCY_TS" IS NOT NULL OR
              "SOFTWARE_WIN_RECENCY_TS" IS NOT NULL) OR
              "SOFTWARE_MAC_RECENCY_TS"-"PRODUCT1_RECENCY_TS">90 OR
              "SOFTWARE_WIN_RECENCY_TS"-"PRODUCT1_RECENCY_TS">90))

As you can see, the query construct to answer the business question is straight forward and requires just one pass over the table.

Version 2: The Row Based (Unpivot) Table, Take 1

In Version 2, there is a single row (tuple) which tracks the customer, product and the recency date. Unlike Version 1, none of the columns can be NULL.

SQL> desc row_tab
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUSTOMER_ID                               NOT NULL NUMBER
 RECENCY_TS                                NOT NULL DATE
 PRODUCT                                   NOT NULL VARCHAR2(32)

SQL> select * from row_tab;

CUSTOMER_ID RECENCY_TS PRODUCT
----------- ---------- --------------------------------
        100 2009-03-17 SOFTWARE_MAC
        200 2009-03-17 SOFTWARE_MAC
        300 2009-03-17 SOFTWARE_MAC
        500 2009-03-17 SOFTWARE_MAC
        400 2009-03-17 SOFTWARE_WIN
        100 2008-11-17 PRODUCT1
        200 2009-01-16 PRODUCT1
        300 2008-10-08 PRODUCT1
        400 2008-11-07 PRODUCT1
        300 2009-02-25 PRODUCT2

10 rows selected.

SQL> select a.customer_id
  2  from   row_tab a,
  3         (select customer_id,
  4                 product,
  5                 recency_ts
  6          from   row_tab
  7          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
  8  where  a.customer_id not in (select customer_id
  9                               from   row_tab
 10                               where  product in ('PRODUCT1', 'PRODUCT2')) and
 11         a.customer_id = b.customer_id
 12  union
 13  select a.customer_id
 14  from   row_tab a,
 15         (select customer_id,
 16                 product,
 17                 recency_ts
 18          from   row_tab
 19          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
 20  where  a.customer_id not in (select customer_id
 21                               from   row_tab
 22                               where  product = 'PRODUCT2') and
 23         a.customer_id = b.customer_id and
 24         (a.product = 'PRODUCT1' and
 25          b.recency_ts - a.recency_ts > 90);

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3517586312

---------------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |    11 |   368 |    22  (60)| 00:00:01 |
|   1 |  SORT UNIQUE          |         |    11 |   368 |    22  (60)| 00:00:01 |
|   2 |   UNION-ALL           |         |       |       |            |          |
|*  3 |    HASH JOIN ANTI     |         |    10 |   310 |    10  (10)| 00:00:01 |
|*  4 |     HASH JOIN         |         |    11 |   187 |     7  (15)| 00:00:01 |
|*  5 |      TABLE ACCESS FULL| ROW_TAB |     5 |    70 |     3   (0)| 00:00:01 |
|   6 |      TABLE ACCESS FULL| ROW_TAB |    10 |    30 |     3   (0)| 00:00:01 |
|*  7 |     TABLE ACCESS FULL | ROW_TAB |     5 |    70 |     3   (0)| 00:00:01 |
|*  8 |    HASH JOIN ANTI     |         |     1 |    58 |    10  (10)| 00:00:01 |
|*  9 |     HASH JOIN         |         |     1 |    44 |     7  (15)| 00:00:01 |
|* 10 |      TABLE ACCESS FULL| ROW_TAB |     4 |    88 |     3   (0)| 00:00:01 |
|* 11 |      TABLE ACCESS FULL| ROW_TAB |     5 |   110 |     3   (0)| 00:00:01 |
|* 12 |     TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   4 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   5 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
   7 - filter("PRODUCT"='PRODUCT1' OR "PRODUCT"='PRODUCT2')
   8 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   9 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
       filter("RECENCY_TS"-"A"."RECENCY_TS">90)
  10 - filter("A"."PRODUCT"='PRODUCT1')
  11 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
  12 - filter("PRODUCT"='PRODUCT2')

Version 2, Take 2

The way the query is written in Version 2, Take 1, it requires six accesses to the table. Partly this is because it uses a UNION. In this case the UNION can be removed and replaced with an OR branch.

SQL> select a.customer_id
  2  from   row_tab a,
  3         (select customer_id,
  4                 product,
  5                 recency_ts
  6          from   row_tab
  7          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
  8  where  a.customer_id = b.customer_id and
  9         ((a.customer_id not in (select customer_id
 10                               from   row_tab
 11                               where  product in ('PRODUCT1', 'PRODUCT2')))
 12         or
 13         ((a.customer_id not in (select customer_id
 14                               from   row_tab
 15                               where  product = 'PRODUCT2') and
 16         (a.product = 'PRODUCT1' and
 17          b.recency_ts - a.recency_ts > 90))))
 18  /

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3327813549

-------------------------------------------------------------------------------
| Id  | Operation           | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |         |     1 |    44 |     7  (15)| 00:00:01 |
|*  1 |  FILTER             |         |       |       |            |          |
|*  2 |   HASH JOIN         |         |    11 |   484 |     7  (15)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| ROW_TAB |     5 |   110 |     3   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
|*  5 |   TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
|*  6 |   TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( NOT EXISTS (SELECT 0 FROM "ROW_TAB" "ROW_TAB" WHERE
              "CUSTOMER_ID"=:B1 AND ("PRODUCT"='PRODUCT1' OR "PRODUCT"='PRODUCT2'))
              OR  NOT EXISTS (SELECT 0 FROM "ROW_TAB" "ROW_TAB" WHERE
              "PRODUCT"='PRODUCT2' AND "CUSTOMER_ID"=:B2) AND
              "A"."PRODUCT"='PRODUCT1' AND "RECENCY_TS"-"A"."RECENCY_TS">90)
   2 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   3 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
   5 - filter("CUSTOMER_ID"=:B1 AND ("PRODUCT"='PRODUCT1' OR
              "PRODUCT"='PRODUCT2'))
   6 - filter("PRODUCT"='PRODUCT2' AND "CUSTOMER_ID"=:B1)

This rewrite brings the table accesses down to four from six, so progress is being made, but I think we can do even better.

Version 2, Take 3

SQL is a very powerful language and there is usually more than one way to structure a query. Version 2, Take 1 uses a very literal translation of the business question and Take 2 just does a mild rewrite changing the UNION to an OR. In Version 2, Take 3, I am going to leverage some different, but very powerful functionality to yield the same results.

SQL> -- COLUMN_TAB can be expressed using ROW_TAB with MAX + CASE WHEN + GROUP BY:
SQL> select   customer_id,
  2           max (case
  3                   when product = 'SOFTWARE_MAC'
  4                      then recency_ts
  5                end) software_mac_recency_ts,
  6           max (case
  7                   when product = 'SOFTWARE_WIN'
  8                      then recency_ts
  9                end) software_win_recency_ts,
 10           max (case
 11                   when product = 'PRODUCT1'
 12                      then recency_ts
 13                end) product1_recency_ts,
 14           max (case
 15                   when product = 'PRODUCT2'
 16                      then recency_ts
 17                end) product2_recency_ts
 18  from     row_tab
 19  group by customer_id;

CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
----------- ---------- ---------- ---------- ----------
        100 2009-03-17            2008-11-17
        200 2009-03-17            2009-01-16
        300 2009-03-17            2008-10-08 2009-02-25
        400            2009-03-17 2008-11-07
        500 2009-03-17

5 rows selected.

SQL> -- The original query can be expressed as follows:
SQL> select customer_id
  2  from   (select   customer_id,
  3                   max (case
  4                           when product = 'SOFTWARE_MAC'
  5                              then recency_ts
  6                        end) software_mac_recency_ts,
  7                   max (case
  8                           when product = 'SOFTWARE_WIN'
  9                              then recency_ts
 10                        end) software_win_recency_ts,
 11                   max (case
 12                           when product = 'PRODUCT1'
 13                              then recency_ts
 14                        end) product1_recency_ts,
 15                   max (case
 16                           when product = 'PRODUCT2'
 17                              then recency_ts
 18                        end) product2_recency_ts
 19          from     row_tab
 20          group by customer_id)
 21  where  product2_recency_ts is null and
 22         (((software_win_recency_ts is not null or
 23            software_mac_recency_ts is not null) and
 24           product1_recency_ts is null) or
 25          ((software_win_recency_ts - product1_recency_ts) > 90 or
 26           (software_mac_recency_ts - product1_recency_ts) > 90)
 27         );

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 825621652

-------------------------------------------------------------------------------
| Id  | Operation           | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |         |     1 |    22 |     4  (25)| 00:00:01 |
|*  1 |  FILTER             |         |       |       |            |          |
|   2 |   HASH GROUP BY     |         |     1 |    22 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(MAX(CASE "PRODUCT" WHEN 'PRODUCT2' THEN "RECENCY_TS" END
              ) IS NULL AND ((MAX(CASE "PRODUCT" WHEN 'SOFTWARE_WIN' THEN
              "RECENCY_TS" END ) IS NOT NULL OR MAX(CASE "PRODUCT" WHEN
              'SOFTWARE_MAC' THEN "RECENCY_TS" END ) IS NOT NULL) AND MAX(CASE
              "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END ) IS NULL OR MAX(CASE
              "PRODUCT" WHEN 'SOFTWARE_WIN' THEN "RECENCY_TS" END )-MAX(CASE
              "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END )>90 OR MAX(CASE
              "PRODUCT" WHEN 'SOFTWARE_MAC' THEN "RECENCY_TS" END )-MAX(CASE
              "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END )>90))

Rewriting the query as a CASE WHEN with a GROUP BY not only cleaned up the SQL, it also resulted in a single pass over the table. Version 2, Take 3 reduces the table access from four to one!

Version 2, Take 4: The PIVOT operator in 11g

In 11g the PIVOT operator was introduced and can simplify the query even more.

SQL> -- In 11g the PIVOT operator can be used, so COLUMN_TAB can be expressed as:
SQL> select *
  2  from row_tab
  3  pivot (max(recency_ts) for product in
  4         ('SOFTWARE_MAC' as software_mac_recency_ts,
  5          'SOFTWARE_WIN' as software_win_recency_ts,
  6          'PRODUCT1' as product1_recency_ts,
  7          'PRODUCT2' as product2_recency_ts));

CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
----------- ---------- ---------- ---------- ----------
        100 2009-03-17            2008-11-17
        200 2009-03-17            2009-01-16
        300 2009-03-17            2008-10-08 2009-02-25
        400            2009-03-17 2008-11-07
        500 2009-03-17

5 rows selected.

SQL> -- Using PIVOT the original query can be expressed as:
SQL> select customer_id
  2  from   row_tab
  3  pivot  (max(recency_ts) for product in
  4         ('SOFTWARE_MAC' as software_mac_recency_ts,
  5          'SOFTWARE_WIN' as software_win_recency_ts,
  6          'PRODUCT1' as product1_recency_ts,
  7          'PRODUCT2' as product2_recency_ts))
  8  where  product2_recency_ts is null and
  9         (((software_win_recency_ts is not null or
 10            software_mac_recency_ts is not null) and
 11           product1_recency_ts is null) or
 12          ((software_win_recency_ts - product1_recency_ts) > 90 or
 13           (software_mac_recency_ts - product1_recency_ts) > 90)
 14         );

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3127820873

--------------------------------------------------------------------------------
| Id  | Operation            | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |         |     1 |    22 |     4  (25)| 00:00:01 |
|*  1 |  FILTER              |         |       |       |            |          |
|   2 |   HASH GROUP BY PIVOT|         |     1 |    22 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS FULL | ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(MAX(CASE  WHEN ("PRODUCT"='PRODUCT2') THEN "RECENCY_TS"
              END ) IS NULL AND ((MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_WIN') THEN
              "RECENCY_TS" END ) IS NOT NULL OR MAX(CASE  WHEN
              ("PRODUCT"='SOFTWARE_MAC') THEN "RECENCY_TS" END ) IS NOT NULL) AND
              MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END ) IS NULL
              OR MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_WIN') THEN "RECENCY_TS" END
              )-MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END )>90 OR
              MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_MAC') THEN "RECENCY_TS" END
              )-MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END )>90))

The Big Picture

One thing that I did not touch on is the flexibility of the ROW_TAB design when it comes to evolution. Any number of products can be added without making any modifications to the loading process. In order to do this with the COLUMN_TAB a new column must be added for each new product. The other major difference between the two table designs is that ROW_TAB is insert only while COLUMN_TAB must be updated if the customer exists. Generally one wants to avoid updated in a data warehouse as 1) old data is usually over written and 2) updates are more expensive than inserts.

The other major thing I won’t discuss in detail is how to partition or index (if required) COLUMN_TAB. Think about this. With ROW_TAB it is very straight forward.

Summary

There are many ways to design tables and write queries. Some of them work well, some do not. Some appear impossible at first, only to appear more simple later. Literal translation of a business question into SQL is usually far from optimal. One needs to think about the question being asked, the shape of the data, and the options available to solve that problem as well as the trade offs of those solutions. Remember: table definitions do not have to look like Spreadsheets. Generally only the output of a query needs to.

Don’t get stuck in SQL-92. It is the year 2009. You should be writing your SQL using the constructs that are provided. Often times very complex data transformations can be done with just SQL. Leverage this power.

All experiments performed on 11.1.0.7

Post to Twitter Tweet This Post

11gR1, Data Warehousing, Execution Plans, Oracle, SQL Tuning, VLDB

Oracle Exadata: In Response to Chuck Hollis

October 24th, 2008

Chuck Hollis, VP and Global Marketing CTO at EMC has written a couple blog posts offering his thoughts on Oracle Exadata. The first was “Oracle Does Hardware” which he wrote the day after the product launch. The second, unimpressively titled “I Annoy Kevin Closson at Oracle” was on Monday October 20th which was in response to a blog post by Exadata Performance Architect, Kevin Closson who commented on Chuck’s first post and some comments left on Kevin’s blog.

Clearly Stated Intentions

Since Chuck had disabled comments for his “I Annoy Kevin” post, I’m going to write my comments here. I have no intention to get into some fact-less debate turn flame, but I will make some direct comments with supporting facts and numbers while keeping it professional.

Storage Arrays: Bottleneck or Not?

Chuck thinks:

“…array-based storage technology is not the bottleneck; our work with Oracle [on the Oracle Optimized Warehouse Initiative] and other DW/BI environments routinely shows that we can feed data to a server just as fast as it can take it.”

First let me comment on the Optimized Warehouse Initiative. There have been some good things that have come out of this effort. I believe it has increased the level of awareness when it comes to sizing storage for BI/DW workloads. All too often storage sizing for BI/DW is done by capacity, not I/O bandwidth. The focus is on building balanced systems: systems that can execute queries and workloads such that no one component (CPU/storage connectivity/disk array/disk drives) becomes the bottleneck prematurely. The industry seems to agree: IBM has the Balanced Warehouse and Microsoft has a reference architecture for Project Madison as well.

So the question comes back to: Is array-based storage technology the bottleneck or not? I would argue it is. Perhaps I would use a word other than “bottleneck”, but let’s be clear on the overall challenge here. That is: to read data off disk with speed and efficiently return it to the database host to process it as fast as possible.

Let’s start at the bottom of the stack: hard disk drives. If the challenge is to scan lots of data fast, then how fast data can be read off disk is the first important metric to consider. In the white paper Deploying EMC CLARiiON CX4-960 for Data Warehouse/Decision Support System (DSS) Workloads EMC reports a drive scan rate (for a BI/DW workload) of 20 MB/s using 8+1 RAID-5 and 33 MB/s using a 2+1 RAID-5 LUN configuration. Oracle Exadata delivers drive scan rates around 85 MB/s, a difference of 2.5X to 4.25X. To understand the performance impact of this I’ve put together a few tables of data based on these real workload numbers.

Hardware Specs and Numbers for Data Warehouse Workloads

Storage RAID Raw:Usable Ratio Disk Drives Disk Scan Rate
EMC CX4-960
8+1 RAID 5
9:8
146 GB FC 15k RPM
20 MB/s
EMC CX4-960
2+1 RAID 5
3:2
146 GB FC 15k RPM
33 MB/s
EMC CX4-960
8+1 RAID 5
9:8
300 GB FC 15k RPM
20 MB/s
EMC CX4-960
2+1 RAID 5
3:2
300 GB FC 15k RPM
33 MB/s
Oracle Exadata
ASM Mirroring
2:1
450 GB SAS 15k RPM
85 MB/s

Sizing By Capacity

Storage RAID Total Usable Space Disk Drive Number of Drives Total Scan Rate
EMC CX4-960
8+1 RAID 5
18 TB
146 GB
139
2.8 GB/s
EMC CX4-960
2+1 RAID 5
18 TB
146 GB
185
6.1 GB/s*
EMC CX4-960
8+1 RAID 5
18 TB
300 GB
68
1.4 GB/s
EMC CX4-960
2+1 RAID 5
18 TB
300 GB
90
3.0 GB/s
Oracle Exadata
ASM Mirroring
18 TB
450 GB
80
6.8 GB/s

* I’m not sure that the CX4-960 array head is capable of 6.1 GB/s so it likley takes at least 2 CX4-960 array heads to deliver this throughput to the host(s).

Sizing By Scan Rate

Storage RAID Total Scan Rate Disk Drive Number of Drives Total Usable Space
EMC CX4-960
8+1 RAID 5
3.00 GB/s
146 GB
150
19.46 TB
EMC CX4-960
2+1 RAID 5
3.00 GB/s
146 GB
90
8.76 TB
EMC CX4-960
8+1 RAID 5
3.00 GB/s
300 GB
150
40.00 TB
EMC CX4-960
2+1 RAID 5
3.00 GB/s
300 GB
90
18.00 TB
Oracle Exadata
ASM Mirroring
3.00 GB/s
450 GB
36
8.10 TB

A Few Comments On The Above Data Points

Please note that “Total Usable Space” is a rough number for the total protected disk space one can use for a database if you filled each drive up to capacity. It does not take into consideration things like loss for formatting, space for sort/temp, etc, etc. I would use a 60% rule for estimating data space for database vs. total usable space. This means that 18 TB of total usable space would equate to 10 TB (max) of space for database data (compression not accounted for).

I’d also like to note that in the Sizing By Capacity table the “Total Scan Rate” is a disk only calculation. Whether or not a single CX4-960 array head can move data at that rate is in question. Based on the numbers in the EMC whitepaper it would appear CX4-960 head is capable of 3 GB/s but I would question if it is capable of much more than that, hence the reason for the asterisk(*).

Looking At The Numbers

If you look at the number for Sizing By Capacity, you can see that for the given fixed size, Exadata provides the fastest scan rate while using only 80 disk drives. The next closest scan rate is just 700 MB/s less but it uses 105 more disk drives (80 vs. 185). Quite a big difference.

When it comes to delivering I/O bandwidth, Exadata clearly stands out. Targeting a scan rate of 3 GB/s, Exadata delivers this using only 36 drives, just 3 Exadata Storage Servers. If one wanted to deliver this scan rate with the CX4 it would take 2.5X as many drives (90 vs. 36) using 2+1 RAID 5.

So are storage arrays the bottleneck? You can draw your own conclusions, but I think the numbers speak to the performance advantage with Oracle Exadata when it comes to delivering I/O bandwidth and fast scan rates. Consider this: What would the storage topology look like if you wanted to deliver a scan rate of 74 GB/s as we did for Oracle OpenWorld with 84 HP Oracle Exadata Storage Servers (6 HP Oracle Database Machines)? Honestly I would struggle to think where I would put the 185 or so 4Gb HBAs to achieve that.

Space Saving RAID or Wasteful Mirroring

This leads me to another comment by Chuck in his second post:

“[with Exadata] The disk is mirrored, no support of any space-saving RAID options — strange, for such a large machine”

And this one in his first post:

“If it were me, I’d want a RAID 5 (or 6) option.”

And his comment on Kevin’s blog:

“The fixed ratio of 12 disks (6 usable) per server element strikes us as a bit wasteful….And, I know this only matters to storage people, but there’s the minor matter of having two copies of everything, rather than the more efficient parity RAID approaches. Gets your attention when you’re talking 10-40TB usable, it does.”

Currently Exadata uses ASM mirroring for fault tolerance so there is a 2:1 ratio of raw disk to usable disk, however I don’t think it matters much. The logic behind that comment is that when one is sizing for a given scan rate, Exadata uses less spindles than the other configurations even though the disk protection is mirroring and not space-saving RAID 5. I guess I think it is strange to worry about space savings when disks just keep getting bigger and many are keeping the same performance characteristics as their predecessors. Space is cheap. Spindles are expensive. When one builds a configuration that satisfies the I/O scan rate requirement, chances are you have well exceeded the storage capacity requirement, even when using mirroring.

Perhaps Chuck likes space-saving RAID 5, but I think using less drives (0.4 as many, 36 vs. 90) to deliver the same scan rate is hardly wasteful. You know what really gets my attention? Having 40 TB of total usable space on 15 HP Oracle Exadata Storage Servers (180 450GB SAS drives) and being able to scan it at 15 GB/s compared to say having a CX4 with 200 drives @ 300GB using 2+1 R5 and only being able to scan them at 6.6 GB/s. I’d also be willing to bet that would require at least 2 if not 3 CX4-960 array heads and at least 30 4Gb HBAs running at wire speed (400 MB/s).

Exadata Is Smart Storage

Chuck comments:

“Leaving hardware issues aside, how much of the software functionality shown here is available on generic servers, operating systems and storage that Oracle supports today? I was under the impression that most of this great stuff was native to Oracle products, and not a function of specific tin …

If the Exadata product has unique and/or specialized Oracle logic, well, that’s a different case.”

After reading that I would said Chuck has not read the Technical Overview of the HP Oracle Exadata Storage Server. Not only does Exadata have a very fast scan rate, it has intelligence. A combination of brawn and brains which is not available with other storage platforms. The Oracle Exadata Storage Server Software (say that 5 times fast!!!) is not an Oracle database. It is storage software not database software. The intelligence and specialized logic is that Exadata Smart Scans return only the relevant rows and columns of a query, allowing for better use of I/O bandwidth and increased database performance because the database host(s) are not issuing I/O requests for data that is not needed for the query and then processing it post-fact. There are a couple slides (18 & 19) referencing a simple example of the benifits of Smart Scans in the HP Oracle Exadata Storage Server technical overview slide deck. It is worth the read.

It Will Be Interesting Indeed

Chuck concludes his second post with:

“The real focus here should be software, not hardware.”

Personally I think the focus should be on solutions that perform and scale and I think the HP Oracle Exadata Storage Server is a great solution for Oracle data warehouses that require large amounts of I/O bandwidth.

Ending On A Good Note

While many comments by Chuck do not seem to be well researched I would comment that having a conventional mid-range storage array that can deliver 3 GB/s is not a bad thing at all. I’ve seen many Oracle customers that have only a fraction of that and there are probably some small data warehouses out there that may run fine with 3 GB/s of I/O bandwidth. However, I think that those would run even faster with Oracle Exadata and I’ve never had a customer complain about queries running too fast.

Post to Twitter Tweet This Post

Exadata, Oracle, Performance, VLDB

Oracle 11g: Incremental Global Statistics On Partitioned Tables

July 16th, 2008

Previously I blogged about the new and improved DBMS_STATS.AUTO_SAMPLE_SIZE used to calculate NDV in Oracle 11g and now I wanted to touch on another new feature of DBMS_STATS in 11g: Incremental Global Statistics On Partitioned Tables.

Before Incremental Global Stats (Two-Pass Method)

When DBMS_STATS.GATHER_TABLE_STATS collects statistics on a partitioned table, generally it does so at the partition and table (global) level (the default behavior can be modified by changing the GRANULARITY parameter). This is done in two steps. First, partition level stats are gathered by scanning the partition(s) that have stale or empty stats, then a full table scan is executed to gather the global statistics. As more partitions are added to a given table, the longer the execution time for GATHER_TABLE_STATS, due to the full table scan requited for global stats.

Using Incremental Global Stats (Synopsis-Based Method)

Incremental Global Stats works by collecting stats on partitions and storing a synopsis which is the statistics metadata for that partition and the columns for that partition. This synopsis is stored in the SYSAUX tablespace, but is quite small (only a few kilobytes). Global stats are then created not by reading the entire table, but by aggregating the synopses from each partition. Incremental Global Stats, in conjunction with the new 11g DBMS_STATS.AUTO_SAMPLE_SIZE, yield a significant reduction in the time to collect statistics and produce near perfect accuracy.

Turning On Incremental Global Stats

Incremental Global Stats can only be used for partitioned tables and is activated by this command:

SQL> exec DBMS_STATS.SET_TABLE_PREFS(user,'FOO','INCREMENTAL','TRUE')

-- To see the value for INCREMENTAL for a given table:

SQL> select dbms_stats.get_prefs('INCREMENTAL', tabname=>'FOO') from dual;

DBMS_STATS.GET_PREFS('INCREMENTAL',TABNAME=>'FOO')
--------------------------------------------------
TRUE

You may also use any of the other DBMS_STATS.SET_*_PREFS as well.

A Real-World Example

To demonstrate the benefit of Incremental Global Statistics, I created a range partitioned table consisting of 60 range partitions. The target table starts empty and one million (1,000,000) rows are inserted into a single partition of the table and then statistics are gathered. This is done 60 times, simulating loading 60 one day partitions (one at a time) emulating a daily ETL/ELT process over 60 days.


Incremental_Stats.png

Elapsed Times
Partitions
Incremental=FALSE
Incremental=TRUE
1
00:00:20.36
00:00:21.14
10
00:02:27.25
00:00:37.76
20
00:04:46.23
00:00:49.83
30
00:07:05.47
00:01:01.80
40
00:09:11.09
00:01:23.33
50
00:11:33.18
00:01:30.40
60
00:13:18.15
00:01:40.28
Cumulative Elapsed Time
06:42:21.20
01:00:53.80

As you can see from the chart and the table, without Incremental Global Stats the time to gather stats increases pretty much linearly with the number of partitions, but with Incremental Global Stats the elapse time only slightly increases. The big difference is in the cumulative elapsed time: It takes 6 hours 42 minutes without Incremental Global Stats, but only 1 hour with. Quite a significant savings over time!

Revisiting The Math

For this experiment the time to gather stats without Incremental Global Stats is:
(time to scan & gather for 1 partition) + (time to scan and gather for entire table)
When Incremental Global Stats is used the time to gather stats is:
(time to scan & gather for 1 partition) + (time to aggregate all synopses)

The Diff Test

I exported the stats into a stats table and then ran the diff to compare the two runs. This will show us how comparable the two methods of stats gathering are.

SQL> set long 500000 longchunksize 500000
SQL> select report, maxdiffpct from
     table(dbms_stats.diff_table_stats_in_stattab(user,'CATALOG_SALES','STATS_DEFAULT','STATS_INC'));

REPORT
------------------------------------------------------------------------------------
MAXDIFFPCT
----------
###############################################################################

STATISTICS DIFFERENCE REPORT FOR:
.................................

TABLE	      : CATALOG_SALES
OWNER	      : TPCDS
SOURCE A      : User statistics table STATS_DEFAULT
	      : Statid	   :
	      : Owner	   : TPCDS
SOURCE B      : User statistics table STATS_INC
	      : Statid	   :
	      : Owner	   : TPCDS
PCTTHRESHOLD  : 10
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NO DIFFERENCE IN TABLE / (SUB)PARTITION STATISTICS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

COLUMN STATISTICS DIFFERENCE:
.............................

COLUMN_NAME	SRC NDV     DENSITY    HIST NULLS   LEN  MIN   MAX   SAMPSIZ
...............................................................................

CS_BILL_ADDR_SK A   1001152 .000000998 NO   148640  5	 C102  C402  5.9E+07
		B   1001176 .000000998 NO   148613  5	 C102  C402  5.9E+07
CS_BILL_CDEMO_S A   1868160 .000000535 NO   148646  6	 C102  C4025 5.9E+07
		B   1878320 .000000532 NO   148753  6	 C102  C4025 5.9E+07
CS_BILL_CUSTOME A   1942528 .000000514 NO   148104  6	 C102  C403  5.9E+07
		B   1949464 .000000512 NO   148192  6	 C102  C403  5.9E+07
CS_BILL_HDEMO_S A   7200    .000138888 NO   148227  4	 C102  C249  5.9E+07
		B   7200    .000138888 NO   148250  4	 C102  C249  5.9E+07
CS_CALL_CENTER_ A   30	    .033333333 NO   148310  3	 C102  C11F  5.9E+07
		B   30	    .033333333 NO   148272  3	 C102  C11F  5.9E+07
CS_CATALOG_PAGE A   11092   .000090155 NO   148111  5	 C102  C3023 5.9E+07
		B   11092   .000090155 NO   148154  5	 C102  C3023 5.9E+07
CS_EXT_LIST_PRI A   1133824 .000000881 NO   148461  6	 C102  C3036 5.9E+07
		B   1131680 .000000883 NO   148368  6	 C102  C3036 5.9E+07
CS_EXT_WHOLESAL A   394880  .000002532 NO   148842  5	 C102  C302  5.9E+07
		B   394880  .000002532 NO   148772  5	 C102  C302  5.9E+07
CS_ITEM_SK	A   205888  .000004857 NO   0	    5	 C102  C3152 5.9E+07
		B   205408  .000004868 NO   0	    5	 C102  C3152 5.9E+07
CS_LIST_PRICE	A   29896   .000033449 NO   148438  5	 C102  C204  5.9E+07
		B   29896   .000033449 NO   148458  5	 C102  C204  5.9E+07
CS_ORDER_NUMBER A   7151104 .000000139 NO   0	    6	 C102  C4102 5.9E+07
		B   7122072 .000000140 NO   0	    6	 C102  C4102 5.9E+07
CS_PROMO_SK	A   1000    .001       NO   148617  4	 C102  C20B  5.9E+07
		B   1000    .001       NO   148693  4	 C102  C20B  5.9E+07
CS_QUANTITY	A   100     .01        NO   148737  3	 C102  C202  5.9E+07
		B   100     .01        NO   148751  3	 C102  C202  5.9E+07
CS_SHIP_ADDR_SK A   1001088 .000000998 NO   148150  5	 C102  C402  5.9E+07
		B   1001152 .000000998 NO   148235  5	 C102  C402  5.9E+07
CS_SHIP_CDEMO_S A   1870592 .000000534 NO   148918  6	 C102  C4025 5.9E+07
		B   1878272 .000000532 NO   148862  6	 C102  C4025 5.9E+07
CS_SHIP_CUSTOME A   1938816 .000000515 NO   148300  6	 C102  C403  5.9E+07
		B   1948928 .000000513 NO   148309  6	 C102  C403  5.9E+07
CS_SHIP_DATE_SK A   1884    .000530785 NO   148674  6	 C4032 C4032 5.9E+07
		B   1884    .000530785 NO   148608  6	 C4032 C4032 5.9E+07
CS_SHIP_HDEMO_S A   7200    .000138888 NO   148172  4	 C102  C249  5.9E+07
		B   7200    .000138888 NO   148161  4	 C102  C249  5.9E+07
CS_SHIP_MODE_SK A   20	    .05        NO   148437  3	 C102  C115  5.9E+07
		B   20	    .05        NO   148486  3	 C102  C115  5.9E+07
CS_SOLD_DATE_SK A   1595    .000626959 NO   0	    6	 C4032 C4032 5.9E+07
		B   1587    .000630119 NO   0	    6	 C4032 C4032 5.9E+07
CS_WAREHOUSE_SK A   15	    .066666666 NO   148651  3	 C102  C110  5.9E+07
		B   15	    .066666666 NO   148620  3	 C102  C110  5.9E+07
CS_WHOLESALE_CO A   9901    .000100999 NO   149054  4	 C102  C202  5.9E+07
		B   9901    .000100999 NO   149099  4	 C102  C202  5.9E+07

The stats diff shows that for many columns the NDV is identical and the others are statistically equivalent (close enough to be the same). I will certainly be adding this feature to my “conviction must use list” for Oracle 11g.

Further Reading

If you are interested in the bits and bytes of how the synopsis-based method works, I would suggest you read the whitepaper, Efficient and Scalable Statistics Gathering for Large Databases in Oracle 11g that was presented on this topic at SIGMOD 2008.

Post to Twitter Tweet This Post

10gR2, 11gR1, Data Warehousing, Execution Plans, Optimizer, Oracle, Statistics, Troubleshooting, VLDB

Using Bitmap Indexes Effectively

May 29th, 2008

Recently I was reading this thread, “Trying to make use of bitmap indexes” on the Oracle Forum. Before I had finished a working example, Jonathan Lewis had posted his response which was on par with my thoughts. Since this is a topic I see frequently, I thought I would finish my experiment and publish it here.

What We Are Given

The author of the original post had stated that the table in question contains about 16 million rows and states: “The table contains three IDEAL columns for bitmap indexes the first of which
may have only two, the second three and the third four distinct values. I was planning to change the index type on these columns to BITMAP [from B-tree]
.” To keep the focus of this post narrow, I’m only going to discuss whether or not one should consider bitmap indexes for queries, and not discuss potential update related issues.

The Data

For this experiment, I’m going to create a table that has three columns with the given NDV from above and add in a few extra filler columns to pad it out a bit. Since I do not know the exact table structure, I’ll just go with a simple example. In reality, the posters table may be wider, but for this example, it is what it is.

create table bm_test
nologging compress
as
select
  round(dbms_random.value(1, 2)) a  -- NDV 2
, round(dbms_random.value(1, 3)) b  -- NDV 3
, round(dbms_random.value(1, 4)) c  -- NDV 4
, to_char(800000+100000*dbms_random.normal,'fm000000000000') c3
, to_char(800000+100000*dbms_random.normal,'fm000000000000') c4
, to_char(15000+2000*dbms_random.normal,'fm000000') c5
, to_char(80000+10000*dbms_random.normal,'fm000000000000') c6
from dual
connect by level <= 16000000
/

desc bm_test
 Name		   Null?    Type
 ----------------- -------- ------------
 A			    NUMBER
 B			    NUMBER
 C			    NUMBER
 C3			    VARCHAR2(13)
 C4			    VARCHAR2(13)
 C5			    VARCHAR2(7)
 C6			    VARCHAR2(13)

exec dbms_stats.gather_table_stats(user,'BM_TEST');

create bitmap index bm1 on bm_test(a);
create bitmap index bm2 on bm_test(b);
create bitmap index bm3 on bm_test(c);

select a, b, c, count(*)
from bm_test
group by a,b,c
order by a,b,c;

         A          B          C   COUNT(*)
---------- ---------- ---------- ----------
         1          1          1     333292
         1          1          2     666130
         1          1          3     666092
         1          1          4     333585
         1          2          1     668594
         1          2          2    1332121
         1          2          3    1332610
         1          2          4     668608
         1          3          1     333935
         1          3          2     666055
         1          3          3     666619
         1          3          4     333106
         2          1          1     333352
         2          1          2     665038
         2          1          3     665000
         2          1          4     333995
         2          2          1     669120
         2          2          2    1332744
         2          2          3    1332766
         2          2          4     668411
         2          3          1     333891
         2          3          2     665924
         2          3          3     664799
         2          3          4     334213

24 rows selected.

select segment_name,
       segment_type,
       sum(blocks) blocks,
       sum(bytes)/1024/1024 mb
from user_segments
where segment_name like 'BM%'
group by segment_name, segment_type;

SEGMENT_NAME SEGMENT_TYPE     BLOCKS         MB
------------ ------------ ---------- ----------
BM_TEST      TABLE            102592      801.5
BM1          INDEX               768          6
BM2          INDEX              1152          9
BM3          INDEX              1408         11

select object_name, object_id
from user_objects
where object_name like 'BM%'

OBJECT_NAME   OBJECT_ID
------------ ----------
BM_TEST           54744
BM1               54745
BM2               54746
BM3               54747

The Queries And Execution Plans

The original post did not contain any queries or predicates, so for the purpose of this example I’m going to assume that there are exactly three predicates, one on each of column A, B and C, and that each predicate is a single equality (e.g. A=1 and B=1 and C=1). Looking at the data distribution from the query above, we observe there are approximately three different grouping counts: the lower around 333,000 the middle around 666,000 and the upper around 1,300,000. I will choose tuples from each of these groupings for the three test cases.

Query A

select *
from bm_test
where a=1 and b=1 and c=1;

333292 rows selected.
Plan hash value: 3643416817

----------------------------------------------------------------------------------------
| Id  | Operation                    | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |         |       |       | 23314 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID | BM_TEST |   326K|    17M| 23314   (1)| 00:04:40 |
|   2 |   BITMAP CONVERSION TO ROWIDS|         |       |       |            |          |
|   3 |    BITMAP AND                |         |       |       |            |          |
|*  4 |     BITMAP INDEX SINGLE VALUE| BM3     |       |       |            |          |
|*  5 |     BITMAP INDEX SINGLE VALUE| BM2     |       |       |            |          |
|*  6 |     BITMAP INDEX SINGLE VALUE| BM1     |       |       |            |          |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("C"=1)
   5 - access("B"=1)
   6 - access("A"=1)

Query B

select *
from bm_test
where a=1 and b=1 and c=2;

666130 rows selected.
Plan hash value: 3202922749

----------------------------------------------------------------------------------------
| Id  | Operation                    | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |         |       |       | 27105 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID | BM_TEST |   653K|    34M| 27105   (1)| 00:05:26 |
|   2 |   BITMAP CONVERSION TO ROWIDS|         |       |       |            |          |
|   3 |    BITMAP AND                |         |       |       |            |          |
|*  4 |     BITMAP INDEX SINGLE VALUE| BM2     |       |       |            |          |
|*  5 |     BITMAP INDEX SINGLE VALUE| BM1     |       |       |            |          |
|*  6 |     BITMAP INDEX SINGLE VALUE| BM3     |       |       |            |          |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("B"=1)
   5 - access("A"=1)
   6 - access("C"=2)

Query C

select *
from bm_test
where a=1 and b=2 and c=2;

1332121 rows selected.
Plan hash value: 1873942893

-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |       |       | 28243 (100)|          |
|*  1 |  TABLE ACCESS FULL| BM_TEST |  1377K|    72M| 28243   (2)| 00:05:39 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(("C"=2 AND "B"=2 AND "A"=1))

As you can see from the execution plans, Query A and B use the bitmap indexes and Query C uses a Full Table Scan. Of the 16,000,000 rows, Query A returns 333,292 (2.08%), Query B returns 666,130 (4.16%) and Query C returns 1,332,121 rows (8.33%). I think it is important to note that the change in the execution plan from index access to table scan is due to the costing, not directly due to the percentage of data returned.

Execution Times

I’m going to gather two sets of execution times. The first will be with a cold buffer cache, and the second with a warm buffer cache. All elapsed times are in seconds.

Query Execution Plan Cold Cache Warm Cache
A Bitmap Index 38 3
B Bitmap Index 40 4
C FTS 16 16

As you can see from the execution times, there is a significant difference (approx. 11x) between the cold and warm cache executions of each Query A and Query B. The other observation is that Query C (FTS) is faster than Query A (Index Access) on a cold cache. We surely need to account for this. One observation I made (from iostat) is that the I/O throughput rate for Query A and Query B was around 23MB/s while the I/O rate for Query C was around the 55MB/s range during the cold cache execution. None of the queries used the Parallel Query Option.

Lets take a look at the tkprof output from both the cold and warm cache executions of Query A and see if we can find where the time is being spent. The traces were collected using event 10046, level 8.

Query A TKPROF – Warm Cache

select /* warm cache */ *
from bm_test
where a=1 and b=1 and c=1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     3334      2.20       2.17          0     102184          0      333292
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     3336      2.20       2.18          0     102184          0      333292

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 31

Rows     Row Source Operation
-------  ---------------------------------------------------
 333292  TABLE ACCESS BY INDEX ROWID BM_TEST (cr=102184 pr=0 pw=0 time=19332 us cost=23314 size=17945290 card=326278)
 333292   BITMAP CONVERSION TO ROWIDS (cr=1162 pr=0 pw=0 time=2329 us)
     92    BITMAP AND  (cr=1162 pr=0 pw=0 time=1691 us)
    642     BITMAP INDEX SINGLE VALUE BM3 (cr=367 pr=0 pw=0 time=104 us)(object id 54747)
    697     BITMAP INDEX SINGLE VALUE BM2 (cr=396 pr=0 pw=0 time=92 us)(object id 54746)
    727     BITMAP INDEX SINGLE VALUE BM1 (cr=399 pr=0 pw=0 time=117 us)(object id 54745)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                    3337        0.00          0.00
  SQL*Net message from client                  3337        0.00          1.04

When the cache is warm, there are no physical reads that take place. This would explain the fast execution of the query.

Note: For Bitmap execution plans, the number that appears in the rows column is actually bitmap fragments (compressed rowids), not actual rows. This is why the number looks suspiciously small.

Query A TKPROF – Cold Cache

select /* cold cache */ *
from bm_test
where a=1 and b=1 and c=1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     3334     11.44      36.22      99722     102184          0      333292
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     3336     11.45      36.22      99722     102184          0      333292

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 31

Rows     Row Source Operation
-------  ---------------------------------------------------
 333292  TABLE ACCESS BY INDEX ROWID BM_TEST (cr=102184 pr=99722 pw=99722 time=294694 us cost=23314 size=17945290 card=326278)
 333292   BITMAP CONVERSION TO ROWIDS (cr=1162 pr=1041 pw=1041 time=2490 us)
     92    BITMAP AND  (cr=1162 pr=1041 pw=1041 time=5104 us)
    642     BITMAP INDEX SINGLE VALUE BM3 (cr=367 pr=324 pw=324 time=1840 us)(object id 54747)
    697     BITMAP INDEX SINGLE VALUE BM2 (cr=396 pr=351 pw=351 time=1817 us)(object id 54746)
    727     BITMAP INDEX SINGLE VALUE BM1 (cr=399 pr=366 pw=366 time=1534 us)(object id 54745)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                    3336        0.00          0.00
  SQL*Net message from client                  3336        0.00          1.12
  db file sequential read                     99722        0.04         30.60

As you can see the majority of the time was spent on db file sequential read doing the 99,722 physical reads. This explains the difference in elapsed time between the cold and warm cache executions of Query A: it comes down to physical I/O. But why does Query C run in half the time that Query A runs in when the cache is cold, given that Query C is doing a FTS and Query A is not? Shouldn’t the FTS plan be slower than the index plan?

Looking at the raw trace file for Query A, we observe the following:

WAIT #2: nam='db file sequential read' ela= 241 file#=1 block#=1770152 blocks=1 obj#=54744 tim=1212013191665924
WAIT #2: nam='db file sequential read' ela= 232 file#=1 block#=1770153 blocks=1 obj#=54744 tim=1212013191666240
WAIT #2: nam='db file sequential read' ela= 351 file#=1 block#=1770156 blocks=1 obj#=54744 tim=1212013191666650
WAIT #2: nam='db file sequential read' ela= 240 file#=1 block#=1770157 blocks=1 obj#=54744 tim=1212013191666948
WAIT #2: nam='db file sequential read' ela= 298 file#=1 block#=1770158 blocks=1 obj#=54744 tim=1212013191667306

As you can see, the table is being read sequentially 1 block at a time. Let’s examine the TKPROF from Query C.

Query C TKPROF

select *
from bm_test
where a=1 and b=2 and c=2

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch    13323      5.99      11.17     102592     115831          0     1332121
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    13325      5.99      11.17     102592     115831          0     1332121

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 31

Rows     Row Source Operation
-------  ---------------------------------------------------
1332121  TABLE ACCESS FULL BM_TEST
(cr=115831 pr=102592 pw=102592 time=102744 us cost=28243 size=75768825 card=1377615)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                   13325        0.00          0.01
  SQL*Net message from client                 13325        0.00          4.23
  db file sequential read                         2        0.02          0.03
  direct path read                              952        0.08          5.20

The majority of the time is spent on direct path read.

Let’s dig deeper and look at the raw trace file from Query C.

WAIT #2: nam='direct path read' ela= 6029 file number=1 first dba=1785609 block cnt=128 obj#=54744 tim=1212013229612857
WAIT #2: nam='direct path read' ela= 8638 file number=1 first dba=1787017 block cnt=128 obj#=54744 tim=1212013229628256
WAIT #2: nam='direct path read' ela= 7019 file number=1 first dba=1789193 block cnt=128 obj#=54744 tim=1212013229642410
WAIT #2: nam='direct path read' ela= 9276 file number=1 first dba=1791497 block cnt=128 obj#=54744 tim=1212013229658400
WAIT #2: nam='direct path read' ela= 6173 file number=1 first dba=1792777 block cnt=128 obj#=54744 tim=1212013229671314

As you can see with Query C, the read size is 128 blocks or 1MB (128 blocks * 8k block), the largest I/O that Oracle will issue. This explains the difference in the observed I/O throughput (23MB/s vs. 55MB/s): the bitmap index plan reads the table 1 block at a time, and the FTS reads (most of) it 128 blocks at a time. It makes good sense that if the read throughput rate is ~2x (23MB/s vs. 55MB/s) then the execution time would be ~0.5 as long (38 seconds vs. 16 seconds). The larger I/O size will have a higher throughput rate compared to a smaller I/O size. The exact breakdown of the multi-block reads are:

BLOCK_COUNT      COUNT TOTAL_BLOCKS
----------- ---------- ------------
          7          2           14
          8        106          848
          9         34          306
         16         10          160
         33          8          264
        119         42         4998
        128        750        96000
            ---------- ------------
sum                952       102590

Making Sense Of All The Observations

If we look at the tkprof output again from Query A, we see there are 99,722 waits on db file sequential read. Of those 99,722 waits, 98,681 are on the table (grep is our friend here using the raw trace file and the event and object number), the remaining are for the indexes. This tells us that 98,681 out of 102,592 total blocks of the table were retrieved, just 1 block at a time. Basically we have done a very inefficient full table scan. This explains our two observations: 1) why the FTS is faster than the index access plan with a cold cache and 2) why the FTS has a higher read throughput than the index access plan. It all comes down to efficient physical I/O.

The Big Picture

Just because a column has a low NDV does not necessarily mean it is an ideal candidate for a bitmap index. Just like B-tree indexes, bitmap indexes are best leveraged when the combination of them makes it very selective (returns only a small number of rows). The classic example of using a bitmap index on a gender column (male/female) is a horrible one in my opinion. If there are only two values, and there is an even distribution of data, 50% selectivity is too large and thus not a good candidate for a bitmap index. Would you use any index to access 50% of a table?

Bitmap indexes can be very useful in making queries run fast, but if the BITMAP CONVERSION TO ROWIDS returns a large list of rowids, you may find that a FTS (or partition scan) may yield better performance, but may use more I/O resources. It comes down to a trade off: If there is a high buffer cache hit rate for the objects in the bitmap plans, it will run reasonably fast and requite less physical I/O. If the objects are unlikely to be in the buffer cache, a FTS will yield better performance as long as it is not bottlenecked on I/O bandwidth.

Post to Twitter Tweet This Post

10gR2, 11gR1, Data Warehousing, Execution Plans, Optimizer, Performance, SQL Tuning, Statistics, Troubleshooting, VLDB

Top Ways How Not To Scale Your Data Warehouse

April 28th, 2008

Working in the Real-World Performance Group at Oracle has allowed me to see quite a few customers’ data warehouses. Unfortunately some customers find their data warehouse suffering from performance problems, not because there is a platform issue, but often because the features are not used or are not used correctly. I thought I’d put together a list of the most common problems but present them in a facetious manner. The following is meant to be sarcastic and read with a bit of humor. Consider it the “Comedy of Errors” data warehouse edition.

Add An Index To Fix Each Slow Query

If you think the solution for slow queries in a data warehouse is to add indexes, you are probably mistaken. Not only is index access likely the most inefficient way to access lots of rows, the more indexes that are on a table, the longer it takes to load data into that table, even if you build them after the data is loaded; not to mention the extra space that is required for those indexes.

Do Not Use Parallel Operations

There is a reason that row-by-row processing is synonymous with slow-by-slow. It does not scale. If you want to make sure you will have no chance to scale your ETL/ELT, use PL/SQL cursor for loops and process the data serially.

Parallel query is one of the best ways to linearly reduce query response times so not using it would surely help you not scale your data warehouse. Why would one want to solve a few queries in a short amount of time by leveraging a large amount of hardware resources with many parallel processes when you can run many serial queries each using a single process? After all, you would not want to make your CPUs and disks too busy, as they might get tired.

Do Not Use Partitioning

Not effectively using partitioning is probably the best way not to scale your data warehouse. Without leveraging partition elimination in queries, is is pretty much a guarantee that the more data you add to your tables, the longer your queries will run. Give yourself enough time and data, and you will surely have the slowest data warehouse on the planet, if not the most inefficient.

Do Not Use Compression

Using compression would not only reduce the disk space required to store data, it also would reduce the amount of I/O bandwidth to bring data back to the host. Given that most database servers have infinite storage capacity as well as infinite I/O bandwidth, compression would not yield much, if any benefit. Why do less work when you can do more?

Besides, using compression makes the load take longer, right? Given that the data is loaded once and queried thousands (or more) times, it would make sense to optimize for loading and not queries, no?

Do Not Use Analytic SQL

Why would anyone want to use analytic SQL when they can write simpler SQL that performs slower without that functionality. After all, isn’t it better to access the data multiple times versus a single pass? The more times the data is accessed, the more likely it is to be in cache, correct?

Do Not Use Materialized Views Or Aggregates

Why use a “work smarter, not harder” mentality when you can just brute force it. After all, we are manly men and will not be shown up by brains and elegance. We’d rather add up every row from the point of sale table for year end report versus adding up the monthly aggregates. Disk space is cheap and performance is expensive, so perhaps if enough space is saved, it will add up, right?

Do Not Store Data In Its Most Usable Format

Storing data in a uniform case (all upper or all lower) would make the data too clean and usable. It’s much more efficient to change the data with an upper() or lower() function each and every time the data is queried versus changing it once when it is loaded. And while we are at it, let’s store dates as strings and convert them every time as well. Since there is idle CPU from running serial queries, might as well put it to good use changing the case of text and turning strings into dates, no? Why store that GL segment number that everyone queries on in its own column when the users can just use a substring() function to find it. After all, storing redundant data that would make queries perform better is not the primary objective, it’s saving those bytes per row that is critical.

Do Not Leverage Good Design

A data warehouse model is just an OLTP model with much more historical data and a few extra tables for reports, right? One of the best ways to tank your data warehouse’s performance is to design and manage it like it was a large OLTP database. Fundamentally OLTP and data warehousing are completely different and have very little in common, but they can’t be that different, can they?

Even though most BI tools work best when there is well designed model, it is possible to just put a layer of performance killing views in the database to make it look like a well designed dimensional model. This way the ETL can stay simple, the users do not have to write complicated SQL and the BI tools will work (slowly). Why take the time to design a good data model when you can emulate one at one tenth (or less) the scalability.

Consume Excessive Time And Resources Gathering Statistics

Another great way to consume time and compute resources is gathering statistics on data that has not changed or has not significantly changed. If you are regathering statistics on data that has not changed, you should probably consider that Albert Einstein once said “The definition of insanity is doing the same thing over and over again and expecting different results”.

Put The Data Warehouse On A Shared Disk Array

Given that a data warehouse generally requires significant I/O bandwidth and often times use significant disk and compute resources, it would probably be a good idea to share it with other systems. This way the data warehouse (as well as the other applications) will not only have unpredictable performance, but the warehouse will have a scape goat for one to blame when there are issues. Last time I checked, none of the data warehouse appliances or MPP data warehouse databases shared storage with anyone, so maybe they are missing something?

Post to Twitter Tweet This Post

Data Warehousing, Oracle, Performance, VLDB