Category: Exadata

Exadata Smart Flash Logging Explained

I’ve seen some posts on the blogosphere where people attempt to explain (or should I say guess) how Exadata Smart Flash Logging works and most of them are wrong. Hopefully this post will help clear up some the misconceptions out there.

The following is an excerpt from the paper entitled “Exadata Smart Flash Cache Features and the Oracle Exadata Database Machine” that goes into technical detail on the Exadata Smart Flash Logging feature.

Smart Flash Logging works as follows. When receiving a redo log write request, Exadata will do
parallel writes to the on-disk redo logs as well as a small amount of space reserved in the flash
hardware. When either of these writes has successfully completed the database will be
immediately notified of completion. If the disk drives hosting the logs experience slow response
times, then the Exadata Smart Flash Cache will provide a faster log write response time.
Conversely, if the Exadata Smart Flash Cache is temporarily experiencing slow response times
(e.g., due to wear leveling algorithms), then the disk drive will provide a faster response time.
Given the speed advantage the Exadata flash hardware has over disk drives, log writes should be
written to Exadata Smart Flash Cache, almost all of the time, resulting in very fast redo write
performance. This algorithm will significantly smooth out redo write response times and provide
overall better database performance.

The Exadata Smart Flash Cache is not used as a permanent store for redo data – it is just a
temporary store for the purpose of providing fast redo write response time. The Exadata Smart
Flash Cache is a cache for storing redo data until this data is safely written to disk. The Exadata
Storage Server comes with a substantial amount of flash storage. A small amount is allocated for
database logging and the remainder will be used for caching user data. The best practices and
configuration of redo log sizing, duplexing and mirroring do not change when using Exadata
Smart Flash Logging. Smart Flash Logging handles all crash and recovery scenarios without
requiring any additional or special administrator intervention beyond what would normally be
needed for recovery of the database from redo logs. From an end user perspective, the system
behaves in a completely transparent manner and the user need not be aware that flash is being
used as a temporary store for redo. The only behavioral difference will be consistently low
latencies for redo log writes.

By default, 512 MB of the Exadata flash is allocated to Smart Flash Logging. Relative to the 384
GB of flash in each Exadata cell this is an insignificant investment for a huge performance
benefit. This default allocation will be sufficient for most situations. Statistics are maintained to
indicate the number and frequency of redo writes serviced by flash and those that could not be
serviced, due to, for example, insufficient flash space being allocated for Smart Flash Logging.
For a database with a high redo generation rate, or when many databases are consolidated on to
one Exadata Database Machine, the size of the flash allocated to Smart Flash Logging may need
to be enlarged. In addition, for consolidated deployments, the Exadata I/O Resource Manager
(IORM) has been enhanced to enable or disable Smart Flash Logging for the different databases
running on the Database Machine, reserving flash for the most performance critical databases.

Real-World Performance Videos on YouTube – Data Warehousing

Here are some videos of a data warehouse demo that the Real-World Performance Group has been running for a while now and we thought it was time to put them on YouTube. Hope you find them informative.

Migrate a 1TB Data warehouse in 20 Minutes (Part 1)

Migrate a 1TB Data warehouse in 20 Minutes (Part 2)

Migrate a 1TB Data warehouse in 20 Minutes (Part 3)

Migrate a 1TB Data warehouse in 20 Minutes (Part 4)

Debunking More Netezza FUD About Exadata

A reader recently left a comment for which my reply was longer than I’d like to leave for a comment so I’m answering it in detail with this blog post.

Gabramel writes:

Greg,
Nice article. I am just reading the Netezza paper.

You don’t appear to have debunked the following statement.

“Exadata is unable to process this three table join in its MPP tier and instead must inefficiently move all the data required by the calculation across the network to Oracle RAC.”

Not many queries exist where data is only required from two tables. Are Oracle suggesting we need to change the way data is structured to enable best use of Exadata – increasing TCO significantly?

Thanks & Nice post.

There is a reason that I did not debunk that statement – it did not exist in the original version of Netezza’s paper. It seems they have taken the shopping basket example that I debunked in my previous post and replaced it with this one. Nonetheless lets take a look at Netezza’s claim:

Exadata’s storage tier provides Bloom filters to implement simple joins between one large and one smaller table, anything more complex cannot be processed in MPP. Analytical queries commonly require joins more complex than those supported by Exadata. Consider the straightforward case of an international retailer needing insight to the dollar value of sales made in stores located in the UK. This simple SQL query requires a join across three tables – sales, currency and stores.

select sum(sales_value * exchange_rate) us_dollar_sales
from sales, currency, stores
where sales.day = currency.day
and stores.country = 'UK'
and currency.country = 'USA'

Exadata is unable to process this three table join in its MPP tier and instead must inefficiently move all the data required by the calculation across the network to Oracle RAC.

Before I comment, did you spot the error with the SQL query? Hint: Count the number of tables and joins.

Now that we can clearly see that Netezza marketing can not write good SQL because this query contains a cross product as there is no JOIN between sales and stores thus the value returned from this query is not “the [US] dollar value of sales made in stores located in the UK”, it’s some other rubbish number.

Netezza is trying to lead you to believe that sending data to the database nodes (running Oracle RAC) is a bad thing, which is most certainly is not. Let’s remember what Exadata is – Smart Storage. Exadata itself is not an MPP database, so of course it needs to send some data back to the Oracle database nodes where the Oracle database kernel can use Parallel Execution to easily parallelize the execution of this query in an MPP fashion efficiently leveraging all the CPUs and memory of the database cluster.

The reality here is that both Netezza and Oracle will do the JOIN in their respective databases, however, Oracle can push a Bloom filter into Exadata for the STORES.COUNTRY predicate so that the only data that is returned to the Oracle database are rows matching that criteria.

Let’s assume for a moment that the query is correctly written with two joins and the table definitions look like such (at least the columns we’re interested in):

create table sales (
 store_id    number,
 day         date,
 sales_value number
);

create table currency (
 day           date,
 country       varchar2(3),
 exchange_rate number
);

create table stores (
 store_id number,
 country  varchar2(3)
);

select 
    sum(sales.sales_value * currency.exchange_rate) us_dollar_sales
from 
    sales, 
    currency, 
    stores
where 
    sales.day = currency.day
and sales.store_id = stores.store_id
and stores.country = 'UK'
and currency.country = 'USA'

For discussion’s sake, let’s assume the following:

  • There is 1 year (365 days) in the SALES table of billions of rows
  • There are 5000 stores in the UK (seems like a realistic number to me)

There is no magic in those numbers, it’s just something to add context to the discussion, so don’t think I picked them for some special reason. Could be more, could be less, but it really doesn’t matter.

So if we think about the the cardinality for the tables:

  • STORES has a cardinality of 5000 rows
  • CURRENCY has a cardinality of 365 rows (1 year)

The table JOIN order should be STORES -> SALES -> CURRENCY.

With Exadata what will happen is such:

  • Get STORE_IDs from STORE where COUNTRY = ‘UK’
  • Build a Bloom Filter of these 5000 STORE_IDs and push them into Exadata
  • Scan SALES and apply the Bloom Filter in storage, retuning only rows for UK STORE_IDs and project only the necessary columns
  • JOIN that result to CURRENCY
  • Compute the SUM aggregate

All of these operations are performed in parallel using Oracle’s Parallel Execution.

Netezza suggests that Exadata can use Bloom filters for only two table joins (1 big, 1 small) and that analytical queries are more complex than that so Exadata can not use a Bloom filter and provide an example to suggest such. The reality is not only is their example incorrectly written SQL, it also works great with Exadata Bloom filters and it is more than 2 tables! In addition, it is a great demonstration of efficient and smart data movement as Exadata can smartly filter using Bloom filters and needs to only project a very few columns, thus likely creating a big savings versus sending all the columns/rows from the storage. Thus Exadata Bloom filters can work with complex analytical queries of more than two tables and efficiently send data across the network to the Oracle RAC cluster where Parallel Execution will work on the JOINs and aggregation in an MPP manor.

Now to specifically answer your question: No, Oracle is not suggesting you need to change your data/queries to support two table joins, Exadata will likely work fine with what you have today. And to let you and everyone else in on a little secret: Exadata actually supports applying multiple Bloom filters to a table scan (we call this a Bloom filter list denoted by the Predicate Information of a query plan by SYS_OP_BLOOM_FILTER_LIST), so you can have multiple JOIN filters being applied in the Exadata storage, so in reality Bloom filters are not even limited to just 2 table JOINs.

Oh well, so much for Netezza competitive marketing. Just goes to show that Netezza has a very poor understanding how Exadata really works (yet again).

Making the Most of Oracle Exadata – A Technical Review

Over the past few weeks several people have asked me about an Exadata article entitled “Making the Most of Oracle Exadata” by Marc Fielding of Pythian. Overall it’s an informative article and touches on many of the key points of Exadata, however, even though I read (skimmed is a much better word) and briefly commented on the article back in August, after further review I found some technical inaccuracies with this article so I wanted to take the time to clarify this information for the Exadata community.

Exadata Smart Scans

Marc writes:

Smart scans: Smart scans are Exadata’s headline feature. They provide three main benefits: reduced data transfer volumes from storage servers to databases, CPU savings on database servers as workload is transferred to storage servers, and improved buffer cache efficiency thanks to column projection. Smart scans use helper processes that function much like parallel query processes but run directly on the storage servers. Operations off-loadable through smart scans include the following:

  • Predicate filtering – processing WHERE clause comparisons to literals, including logical operators and most SQL functions.
  • Column projection – by looking at a query’s SELECT clause, storage servers return only the columns requested, which is a big win for wide tables.
  • Joins – storage servers can improve join performance by using Bloom filters to recognize rows matching join criteria during the table scan phase, avoiding most of the I/O and temporary space overhead involved in the join processing.
  • Data mining model scoring – for users of Oracle Data Mining, scoring functions like PREDICT() can be evaluated on storage servers.

I personally would not choose a specific number of benefits from Exadata Smart Scan, simply stated, the design goal behind Smart Scan is to reduce the amount of data that is sent from the storage nodes (or storage arrays) to the database nodes (why move data that is not needed?). Smart Scan does this in two ways: it applies the appropriate column projection and row restriction rules to the data as it streams off of disk. However, projection is not limited to just columns in the SELECT clause, as Marc mentions, it also includes columns in the WHERE clause as well. Obviously JOIN columns need to be projected to perform the JOIN in the database nodes. The one area that Smart Scan does not help with at all is improved buffer cache efficiency. The reason for this is quite simple: Smart Scan returns data in blocks that were created on-the-fly just for that given query — it contains only the needed columns (projections) and has rows filtered out from the predicates (restrictions). Those blocks could not be reused unless someone ran the exact same query (think of those blocks as custom built just for that query). The other thing is that Smart Scans use direct path reads (cell smart table scan) and these reads are done into the PGA space, not the shared SGA space (buffer cache).

As most know, Exadata can easily push down simple predicates filters (WHERE c1 = ‘FOO’) that can be applied as restrictions with Smart Scan. In addition, Bloom Filters can be applied as restrictions for simple JOINs, like those commonly found in Star Schemas (Dimensional Data Models). These operations can be observed in the query execution plan by the JOIN FILTER CREATE and JOIN FILTER USE row sources. What is very cool is that Bloom Filters can also pass their list of values to Storage Indexes to aid in further I/O reductions if there is natural clustering on those columns or it eliminates significant amounts of data (as in a highly selective set of values). Even if there isn’t significant data elimination via Storage Indexes, a Smart Scan Bloom Filter can be applied post scan to prevent unneeded data being sent to the database servers.

Exadata Storage Indexes

Marc writes:

Storage indexes: Storage indexes reduce disk I/O volumes by tracking high and low values in memory for each 1-megabyte storage region. They can be used to give partition pruning benefits without requiring the partition key in the WHERE clause, as long as one of these columns is correlated with the partition key. For example, if a table has order_date and processed_date columns, is partitioned on order_date, and if orders are processed within 5 days of receipt, the storage server can track which processed_date values are included in each order partition, giving partition pruning for queries referring to either order_date or processed_date. Other data sets that are physically ordered on disk, such as incrementing keys, can also benefit.

In Marc’s example he states there is correlation between the two columns PROCESSED_DATE and ORDER_DATE where PROCESSED_DATE = ORDER_DATE + [0..5 days]. That’s fine and all, but to claim partition pruning takes place when specifying ORDER_DATE (the partition key column) or PROCESSED_DATE (non partition key column) in the WHERE clause because the Storage Index can be used for PROCESSED_DATE is inaccurate. The reality is, partition pruning can only take place when the partition key, ORDER_DATE, is specified, regardless if a Storage Index is used for PROCESSED_DATE.

Partition Pruning and Storage Indexes are completely independent of each other and Storage Indexes know absolutely nothing about partitions, even if the partition key column and another column have some type of correlation, as in Marc’s example. The Storage Index simply will track which Storage Regions do or do not have rows that match the predicate filters and eliminate reading the unneeded Storage Regions.

Exadata Hybrid Columnar Compression

Marc writes:

Columnar compression: Hybrid columnar compression (HCC) introduces a new physical storage concept, the compression unit. By grouping many rows together in a compression unit, and by storing only unique values within each column, HCC provides storage savings in the range of 80 90% based on the compression level selected. Since data from full table scans remains compressed through I/O and buffer cache layers, disk savings translate to reduced I/O and buffer cache work as well. HCC does, however, introduce CPU and data modification overhead that will be discussed in the next section.

The Compression Unit (CU) for Exadata Hybrid Columnar Compression (EHCC) is actually a logical construct, not a physical storage concept. The compression gains from EHCC come from column-major organization of the rows contained in the CU and the encoding and transformations (compression) that can be done because of that organization (like values are more common within the same column across rows, vs different columns in the same row). To say EHCC only stores unique values within each column is inaccurate, however, the encoding and transformation algorithms use various techniques that yield very good compression by attempting to represent the column values with as few bytes as possible.

Data from EHCC full table scans only remains fully compressed if the table scan is not a Smart Scan, in which case the compressed CUs are passed directly up to the buffer cache and the decompression will then be done by the database servers. However, if the EHCC full table scan is a Smart Scan, then only the columns and rows being returned to the database nodes are decompressed by the Exadata servers, however, predicate evaluations can be performed directly on the EHCC compressed data.

Read more: Exadata Hybrid Columnar Compression Technical White Paper

Marc also writes:

Use columnar compression judiciously: Hybrid columnar compression (HCC) in Exadata has the dual advantages of reducing storage usage and reducing I/O for large reads by storing data more densely. However, HCC works only when data is inserted using bulk operations. If non-compatible operations like single-row inserts or updates are attempted, Exadata reverts transparently to the less restrictive OLTP compression method, losing the compression benefits of HCC. When performing data modifications such as updates or deletes, the entire compression unit must be uncompressed and written in OLTP-compressed form, involving an additional disk I/O penalty as well.

EHCC does require bulk direct path load operations to work. This is because the compression algorithms that are used for EHCC need sets of rows as input, not single rows. What is incorrect with Marc’s comments is that when a row in a CU is modified (UPDATE or DELETE), the entire CU is not uncompressed and changed to non-EHCC compression, only the rows that are UPDATED are migrated to non-EHCC compression. For DELETEs no row migrations take place at all. This is easily demonstrated by tracking ROWIDs as in the example at the bottom of this post.

Exadata Smart Flash Cache

Marc writes:

Flash cache: Exadata s flash cache supplements the database servers buffer caches by providing a large cache of 384 GB per storage server and up to 5 TB in a full Oracle Exadata Database Machine, considerably larger than the capacity of memory caches. Unlike generic caches in traditional SAN storage, the flash cache understands database-level operations, preventing large non-repeated operations such as backups and large table scans from polluting the cache. Since flash storage is nonvolatile, it can cache synchronous writes, providing performance benefits to commit-intensive applications.

While flash (SSD) storage is indeed non-volatile, the Exadata Smart Flash Cache is volatile – it loses all of its contents if the Exadata server is power cycled. Also, since the Exadata Smart Flash is currently a write-through cache, it offers no direct performance advantages to commit-intensive applications, however, it does offer indirect performance advantages by servicing read requests that would otherwise be serviced by the HDDs, thus allowing the HDDs to service more write operations.

Read more: Exadata Smart Flash Cache Technical White Paper

EHCC UPDATE and DELETE Experiment

--
-- EHCC UPDATE example - only modified rows migrate
--

SQL> create table order_items1
  2  compress for query high
  3  as
  4  select rownum as rnum, x.*
  5  from order_items x
  6  where rownum <= 10000;

Table created.

SQL> create table order_items2
  2  as
  3  select rowid as rid, x.*
  4  from order_items1 x;

Table created.

SQL> update order_items1
  2  set quantity=10000
  3  where rnum in (1,100,1000,10000);

4 rows updated.

SQL> commit;

Commit complete.

SQL> select b.rnum, b.rid before_rowid, a.rowid after_rowid
  2  from order_items1 a, order_items2 b
  3  where a.rnum(+) = b.rnum
  4  and (a.rowid != b.rid
  5    or a.rowid is null)
  6  order by b.rnum
  7  ;

           RNUM BEFORE_ROWID       AFTER_ROWID
--------------- ------------------ ------------------
              1 AAAWSGAAAAAO1aTAAA AAAWSGAAAAAO1aeAAA
            100 AAAWSGAAAAAO1aTABj AAAWSGAAAAAO1aeAAB
           1000 AAAWSGAAAAAO1aTAPn AAAWSGAAAAAO1aeAAC
          10000 AAAWSGAAAAAO1aXBEv AAAWSGAAAAAO1aeAAD

--
-- EHCC DELETE example - no rows migrate
--

SQL> create table order_items1
  2  compress for query high
  3  as
  4  select rownum as rnum, x.*
  5  from order_items x
  6  where rownum <= 10000;

Table created.

SQL> create table order_items2
  2  as
  3  select rowid as rid, x.*
  4  from order_items1 x;

Table created.

SQL> delete from order_items1
  2  where rnum in (1,100,1000,10000);

4 rows deleted.

SQL> commit;

Commit complete.

SQL> select b.rnum, b.rid before_rowid, a.rowid after_rowid
  2  from order_items1 a, order_items2 b
  3  where a.rnum(+) = b.rnum
  4  and (a.rowid != b.rid
  5    or a.rowid is null)
  6  order by b.rnum
  7  ;

           RNUM BEFORE_ROWID       AFTER_ROWID
--------------- ------------------ ------------------
              1 AAAWSIAAAAAO1aTAAA
            100 AAAWSIAAAAAO1aTABj
           1000 AAAWSIAAAAAO1aTAPn
          10000 AAAWSIAAAAAO1aXBEv

Oracle Exadata Database Machine Offerings: X2-2 and X2-8

For those who followed or attended Oracle OpenWorld last week you may have seen the introduction of the new hardware for the Oracle Exadata Database Machine. Here’s a high level summary of what was introduced:

  • Updated Exadata Storage Server nodes (based on the Sun Fire X4270 M2)
  • Updated 2 socket 12 core database nodes for the X2-2 (based on the Sun Fire X4170 M2)
  • New offering of 8 socket 64 core database nodes using the Intel 7500 Series (Nehalem-EX) processors for the X2-8 (based on the Sun Fire X4800)

The major updates in the X2-2 compared to V2 database nodes are:

  • CPUs updated from quad-core Intel 5500 Series (Nehalem-EP) processors to six-core Intel 5600 Series (Westmere-EP)
  • Network updated from 1 GbE to 10 GbE
  • RAM updated from 72 GB to 96 GB

The updates to the Exadata Storage Servers (which are identical for both the X2-2 and X2-8 configurations) are:

  • CPUs updated to the six-core Intel 5600 Series (Westmere-EP) processors
  • 600 GB 15k RPM SAS offering now known as HP (High Performance)
  • 2 TB  7.2k RPM SAS offering now known as HC (High Capacity) [previously the 2 TB drives were 7.2k RPM SATA]

One of the big advantages of the CPU updates to the Intel 5600 Series (Westmere-EP) processors is that the Oracle Database Transparent Data Encryption can leverage the Intel Advanced Encryption Standard New Instructions (Intel AES-NI) found in the Intel Integrated Performance Primitives (Intel IPP).  This “in silicon” functionality results in a 10x increase in encryption and an 8x increase in decryption using 256 bit keys per the Oracle Press release.

The differences (as I quickly see) between the X2-2 and the X2-8 offerings are:

  • X2-8 only comes in full racks (of 2 database nodes)
  • X2-8 has 2 TB of RAM per rack (compared to 768 GB for the X2-2)
  • X2-8 has 16s/128c/256t per rack vs. 16s/96c/192t for the X2-2 (s=sockets, c=cores, t=threads)

One of the other Exadata related announcements was that Solaris x86 will be an option for the database OS in addition to Linux.

In summary, the Oracle Exadata Database Machine is riding the wave of Intel processors and is leveraging the Intel IPP functionality and will likely do so for the foreseeable future.

If you want all hardware details, check out the product data sheets:

I just noticed that Alex Gorbachev has a nice table format of the hardware components for your viewing pleasure as well.

Oracle Exadata and Netezza TwinFin Compared – An Engineer’s Analysis

There seems to be little debate that Oracle’s launch of the Oracle Exadata Storage Server and the Sun Oracle Database Machine has created buzz in the database marketplace. Apparently there is so much buzz and excitement around these products that two competing vendors, Teradata and Netezza, have both authored publications that contain a significant amount of discussion about the Oracle Database with Real Application Clusters (RAC) and Oracle Exadata. Both of these vendor papers are well structured but make no mistake, these are marketing publications written with the intent to be critical of Exadata and discuss how their product is potentially better. Hence, both of these papers are obviously biased to support their purpose.

My intent with this blog post is simply to discuss some of the claims, analyze them for factual accuracy, and briefly comment on them. After all, Netezza clearly states in their publication:

The information shared in this paper is made available in the spirit of openness. Any inaccuracies result from our mistakes, not an intent to mislead.

In the interest of full disclosure, my employer is Oracle Corporation, however, this is a personal blog and what I write here are my own ideas and words (see disclaimer on the right column). For those of you who don’t know, I’m a database performance engineer with the Real-World Performance Group which is part of Server Technologies. I’ve been working with Exadata since before it was launched publicly and have worked on dozens of data warehouse proofs-of-concept (PoCs) running on the Exadata powered Sun Oracle Database Machine. My thoughts and comments are presented purely from an engineer’s standpoint.

The following writings are the basis of my discussion:

  1. Teradata: Exadata – the Sequel: Exadata V2 is Still Oracle
  2. Daniel Abadi: Defending Oracle Exadata
  3. Netezza: Oracle Exadata and Netezza TwinFin™ Compared

If you have not read Daniel Abadi’s blog post I strongly suggest you do before proceeding further. I think it is very well written and is presented from a vendor neutral point of view so there is no marketing gobbledygook to sort through. Several of the points in the Teradata writing which he discusses are also presented (or similarly presented) in the Netezza eBook, so you can relate his responses to those arguments as well. Since I feel Daniel Abadi did an excellent job pointing out the major flaws with the Teradata paper, I’m going to limit my discuss to the Netezza eBook.

Understanding Exadata Smart Scan

As a prerequisite for the discussion of the Netezza and Teradata papers, it’s imperative that we take a minute to understand the basics of Exadata Smart Scan. The Smart Scan optimizations include the following:

  • Data Elimination via Storage Indexes
  • Restriction/Row Filtering/Predicate Filtering
  • Projection/Column Filtering
  • Join Processing/Filtering via Bloom Filters and Bloom Pruning

The premise of these optimizations is reduce query processing times in the following ways:

  • I/O Elimination – don’t read data off storage that is not needed
  • Payload Reduction – don’t send data to the Oracle Database Servers that is not needed

OK. Now that you have a basic understanding, let’s dive into the claims…

Netezza’s Claims

Let’s discuss a few of Netezza claims against Exadata:

Claim: Exadata Smart Scan does not work with index-organized tables or clustered tables.

While this is a true statement, its intent is clearly to mislead you. Both of these structures are designed for OLTP workloads, not data warehousing. In fact, if one were to actually read the Oracle Database 11.2 documentation for index-organized tables you would find the following (source):

Index-organized tables are ideal for OLTP applications, which require fast primary key access

If one were to research table clusters you would find the Oracle Database 11.2 documentation offers the following guidelines (source):

Typically, clustering tables is not appropriate in the following situations:

  • The tables are frequently updated.
  • The tables frequently require a full table scan.
  • The tables require truncating.

As anyone can see from the Oracle Database 11.2 Documentation, neither of these structures are appropriate for data warehousing.

Apparently this was not what Netezza really wanted you to know so they uncovered a note on IOTs from almost a decade ago, dating back to 2001 – Oracle 9i time frame, that while it clearly states:

[an IOT] enables extremely fast access to table data for primary key based [OLTP] queries

it also suggests that an IOT may be used as a fact table. Clearly this information is quite old and outdated and should probably be removed. What was a recommendation for Oracle Database 9i Release 1 in 2001 is not necessarily a recommendation for Oracle Database 11g Release 2 in 2010. Technology changes so using the most recent recommendations as a basis for discussion is appropriate, not some old, outdated stuff from nearly 10 years ago. Besides, the Oracle Database Machine runs version 11g Release 2, not 9i Release 1.

Bottom line: I’d say this “limitation” has an impact on a nice round number of Exadata data warehouse customers – exactly zero (zero literally being a round number). IOTs and clustered tables are both structures optimized for fast primary key access, like the type of access in OLTP workloads, not data warehousing. The argument that Smart Scan does not work for these structures is really no argument at all.

Claim: Exadata Smart Scan does not work with the TIMESTAMP datatype.

Phil Francisco seems to have left out some very important context in making this accusation, because this is not at all what the cited blog post by Christian Antognini discusses. This post clearly states the discussion is about:

What happens [with Smart Scan] when predicates contain functions or expressions?

Nowhere at all does that post make an isolated reference that Smart Scan does not work with the TIMESTAMP datatype. What this blog post does state is this:

when a TIMESTAMP datatype is involved [with datetime functions], offloading almost never happens

While the Netezza paper references what the blog post author has written, some very important context has been omitted. In doing so, Netezza has taken a specific reference and turned it into a misleading generalization.

The reality is that Smart Scan does indeed work for the TIMESTAMP datatype and here is a basic example to demonstrate such:

SQL> describe t
 Name           Null?    Type
 -------------- -------- ------------------
 ID             NOT NULL NUMBER
 N                       NUMBER
 BF                      BINARY_FLOAT
 BD                      BINARY_DOUBLE
 D                       DATE
 T                       TIMESTAMP(6)
 S                       VARCHAR2(4000)    

SQL> SELECT * FROM t WHERE t = to_timestamp('01-01-2010','DD-MM-YYYY');

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

----------------------------------------------------------------------------------
| Id  | Operation                 | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |      |     1 |    52 |     4   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS STORAGE FULL| T    |     1 |    52 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - storage("T"=TIMESTAMP' 2010-01-01 00:00:00.000000000')
       filter("T"=TIMESTAMP' 2010-01-01 00:00:00.000000000')

You can see that the Smart Scan offload is taking place by the presence of the storage clause (highlighted) in the Predicate Information section above. What Christian Antognini did observe is bug 9682721 and the bugfix resolves the datetime function offload issues for all but a couple scenarios (which he blogs about here) and those operations can (and usually are) expressed differently. For example, an expression using ADD_MONTHS() can easily be expressed using BETWEEN.

Bottom line: Exadata Smart Scan does work with the TIMESTAMP datatype.

Claim: When transactions (insert, update, delete) are operating against the data warehouse concurrent with query activity, smart scans are disabled. Dirty buffers turn off smart scan.

Yet again, Netezza presents only a half-truth. While it is true that an active transaction disables Smart Scan, they fail to further clarify that Smart Scan is only disabled for those blocks that contain an active transaction – the rest of the blocks are able to be Smart Scanned. The amount of data that is impacted by insert, update, delete will generally be a very small fraction of the total data in a data warehouse. Also, data that is inserted via direct path operations is not subject to MVCC (the method Oracle uses for read consistency) as the blocks that are used are new blocks so no read consistent view is needed.

Bottom line: While this claim is partially true, it clearly attempts to overstate the impact of this scenario in a very negative way. Not having Smart Scan for small number of blocks will have a negligible impact on performance.

Also see Daniel Abadi: Exadata does NOT Support Active Data Warehousing

Claim: Using [a shared-disk] architecture for a data warehouse platform raises concern that contention for the shared resource imposes limits on the amount of data the database can process and the number of queries it can run concurrently.

It is unclear what resource Netezza is referring to here, it simply states “the shared resource”. You know the one? Yes, that one… Perhaps they mean the disks themselves, but that is completely unknown. Anyway…

Exadata uses at least a 4 MB Automatic Storage Management (ASM) allocation unit (AU) [more on ASM basics]. This means that there is at least 4 MB of contiguous physical data laid out on the HDD which translates into 4 MB of contiguous data streamed off of disk for full table scans before the head needs to perform a seek. With such large I/O requests the HDDs are able to spend nearly all the time transferring data, and very little time finding it and that is what matters most. Clearly if Exadata is able to stream data off of disk at 125 MB/s per disk (near physics speed for this type of workload) then any alleged “contention” is really not an issue. In many multi-user data warehouse workloads for PoCs, I’ve observed that each Exadata Storage Server is able to perform very close or at the data sheet physical HDD I/O rate of 1500 MB/s per server.

Bottom line: The scalability differences between shared-nothing and shared-disk are very much exaggerated. By doing large sequential I/Os the disk spends its time returning data, not finding it. Simply put – there really is no “contention”.

Also see Daniel Abadi: 1) Exadata does NOT Enable High Concurrency & 2) Exadata is NOT Intelligent Storage; Exadata is NOT Shared-Nothing

Claim: Analytical queries, such as “find all shopping baskets sold last month in Washington State, Oregon and California containing product X with product Y and with a total value more than $35″ must retrieve much larger data sets, all of which must be moved from storage to database.

I find it so ironic that Netezza mentions this type of query as nearly an identical (but more complex) one was used by my group at Oracle OpenWorld 2009 in our The Terabyte Hour with the Real-World Performance Group session. The exact analytical query we ran live for the audience to demonstrate the features of Oracle Exadata and the Oracle Database Machine was, “What were the most popular items in the baskets of shoppers who visited stores in California in the first week of May and did not buy bananas?”

Let’s translate the Netezza analytical question into some tables and SQL to see what the general shape of this query may look like:

select
   count(*)  -- using count(*) for simplicity of the example
from (
   select
      td.transaction_id,
      sum(td.sales_dollar_amt) total_sales_amt,
      sum(case when p.product_description in ('Brand #42 beer') then 1 else 0 end) count_productX,
      sum(case when p.product_description in ('Brand #42 frozen pizza') then 1 else 0 end) count_productY
   from transaction_detail td
      join d_store s   on (td.store_key = s.store_key)
      join d_product p on (td.product_key = p.product_key)
   where
      s.store_state in ('CA','OR','WA') and
      td.transaction_time >= timestamp '2010-07-01 00:00:00' and
      td.transaction_time <  timestamp '2010-08-01 00:00:00'
   group by td.transaction_id
) x
where
   total_sales_amt > 35 and
   count_productX > 0 and
   count_productY > 0

To me, this isn’t a particularly complex analytical question/query. As written, it’s just a 3 table join (could be 4 if I added a D_DATE I suppose), but it doesn’t require anything fancy – just a simple GROUP BY with a CASE in the SELECT to count how many times Product X and Product Y appear in a given basket.

Netezza claims that analytical queries like this one must move all the data from storage to the database, but that simply is not true. Here is why:

  1. Simple range partitioning on the event timestamp (a very common data warehousing practice for those databases that support partitioning), or even Exadata Storage Indexes, will eliminate any I/O for data other than the one month window that is required for this query.
  2. A bloom filter can be created and pushed into Exadata to be used as a storage filter for the list of STORE_KEY values that represent the three state store restriction.

Applying both of #1 and #2, the only data that is returned to the database for the fact table are rows for stores in Washington State, Oregon and California for last month. Clearly this is only a subset of the data for the entire fact table.

This is just one example, but there are obviously different representations of the same data and query that could be used. I chose what I thought was the most raw, unprocessed, uncooked form simply because Netezza seems to boast about brute force type of operations. Even then, considering a worst case scenario, Exadata does not have to move all the data back to the database. Other data/table designs that I’ve seen from customers in the retail business would allow even less data to be returned.

Bottom line: There are numerous ways that Exadata can restrict the data that is set to the database servers and it’s likely that any query with any predicate restrictions can do so. Certainly it is possible even with the analytic question that Netezza mentions.

Claim: To evenly distribute data across Exadata’s grid of storage servers requires administrators trained and experienced in designing, managing and maintaining complex partitions, files, tablespaces, indices, tables and block/extent sizes.

Interestingly enough, the author of the Teradata paper seems to have a better grasp than Netezza on how data distribution and ASM work describing it on page 9:

Distribution of data on Exadata storage is managed by Oracle’s Automatic Storage Manager (ASM). By default, ASM stripes each Oracle data partition across all available disks on every Exadata cell.

So if by default ASM evenly stripes data across all available disks on Exadata Storage Server (and it does, in a round robin manner) what exactly is so difficult here? What training and experience is really required for something that does data distribution automatically? I can only assert that Phil Francisco has not even read the Teradata paper (but it would seem he has since he even mentions it on his blog), let alone Introduction to Oracle Automatic Storage Management. It’s claims like this that really make me question how genuine his “no intent to mislead” statement really is.

Bottom line: Administrators need not worry about data distribution with Exadata and ASM – it is done automatically and evenly for you.

Conclusion

I’m always extremely reluctant to believe much of what vendors say about other vendors, especially when they preface their publication with something like: “One caveat: Netezza has no direct access to an Exadata machine“, and “Any inaccuracies result from our mistakes, not an intent to mislead” yet they still feel qualified enough to write about said technology and claim it as fact. I also find it interesting that both Teradata and Netezza have published anti-Exadata papers, but neither Netezza nor Teradata have published anti-vendor papers about each other (that I know of). Perhaps Exadata is much more of a competitor than either of them let on. They do protest too much, methinks.

The list of claims I’ve discussed certainly is not an exhaustive list by any means but I think it is fairly representative of the quality found in Netezza’s paper. While sometimes the facts are correct, the arguments are overstated and misleading. Other times, the facts are simply wrong. Netezza clearly attempts to create the illusion of problems simply where they do not exist.

Hopefully this blog post has left you a more knowledgeable person when it comes to Oracle and Exadata. I’ve provided fact and example wherever possible and kept assertions to a minimum.

I’d like to end with a quote from Daniel Abadi’s response to the Teradata paper which I find more than applicable to the Netezza paper as well:

Many of the claims and inferences made in the paper about Exadata are overstated, and the reader needs to be careful not to be mislead into believing in the existence problems that don’t actually present themselves on realistic data sets and workloads.

Courteous and professional comments are welcome. Anonymous comments are discouraged. Snark and flame will end up in the recycle bin. Comment accordingly.

The Core Performance Fundamentals Of Oracle Data Warehousing – Set Processing vs Row Processing

[back to Introduction]

In over six years of doing data warehouse POCs and benchmarks for clients there is one area that I frequently see as problematic: “batch jobs”.  Most of the time these “batch jobs” take the form of some PL/SQL procedures and packages that generally perform some data load, transformation, processing or something similar.  The reason these are so problematic is that developers have hard-coded “slow” into them.  I’m generally certain these developers didn’t know they had done this when they coded their PL/SQL, but none the less it happened.

So How Did “Slow” Get Hard-Coded Into My PL/SQL?

Generally “slow” gets hard-coded into PL/SQL because the PL/SQL developer(s) took the business requirements and did a “literal translation” of each rule/requirement one at a time instead of looking at the “before picture” and the “after picture” and determining the most efficient way to make those data changes.  Many times this can surface as cursor based row-by-row processing, but it also can appear as PL/SQL just running a series of often poorly thought out SQL commands.

Hard-Coded Slow Case Study

The following is based on a true story. Only the facts names have been changed to protect the innocent.

Here is a pseudo code snippet based on a portion of some data processing I saw in a POC:

{truncate all intermediate tables}
insert into temp1 select * from t1 where create_date = yesterday;
insert into temp1 select * from t2 where create_date = yesterday;
insert into temp1 select * from t3 where create_date = yesterday;
insert into temp2 select * from temp1 where {some conditions};
insert into target_table select * from temp2;
for each of 20 columns
loop
  update target_table t
    set t.column_name =
      (select column_name
       from t4
       where t.id=t4.id )
    where i.column_name is null;
end loop
update target_table t set {list of 50 columns} = select {50 columns} from t5 where t.id=t5.id;

I’m going to stop there as any more of this will likely make you cry more than you already should be.

I almost hesitate to ask the question, but isn’t it quite obvious what is broken about this processing?  Here’s the major inefficiencies as I see them:

  • What is the point of inserting all the data into temp1, only then to filter some of it out when temp2 is populated.  If you haven’t heard the phrase “filter early” you have some homework to do.
  • Why publish into the target_table and then perform 20 single column updates, followed by a single 50 column update?  Better question yet: Why perform any bulk updates at all?  Bulk updates (and deletes) are simply evil – avoid them at all costs.

So, as with many clients that come in and do an Exadata Database Machine POC, they really weren’t motivated to make any changes to their existing code, they just wanted to see how much performance the Exadata platform would give them.  Much to their happiness, this reduced their processing time from over 2.5 days (weekend job that started Friday PM but didn’t finish by Monday AM) down to 10 hours, a savings of over 2 days (24 hours).  Now, it could fail and they would have time to re-run it before the business opened on Monday morning.  Heck, I guess if I got back 24 hours out of 38 I’d be excited too, if I were not a database performance engineer who knew there was even more performance left on the table, waiting to be exploited.

Feeling unsatisfied, I took it upon myself to demonstrate the significant payback that re-engineering can yield on the Exadata platform and I coded up an entirely new set-based data flow in just a handful of SQL statements (no PL/SQL).  The result: processing an entire week’s worth of data (several 100s of millions of rows) now took just 12 minutes.  That’s right — 7 days worth of events scrubbed, transformed, enriched and published in just 12 minutes.

When I gently broke the news to this client that it was possible to load the week’s events in just 12 minutes they were quite excited (to say the least).  In fact, one person even said (a bit out of turn), “well, that would mean that a single day’s events could be loaded in just a couple minutes and that would give a new level of freshness to the data which would allow the business to make faster, better decisions due to the timeliness of the data.”  My response: “BINGO!”  This client now had the epiphany of what is now possible with Exadata where previously it was impossible.

It’s Not a Need, It’s a Want

I’m not going to give away my database engineer hat for a product marketing hat just yet (or probably ever), but this is the reality that exists.  IT shops started with small data sets and use small data set programming logic on their data, and that worked for some time.  The reason: because inefficient processing on a small data set is only a little inefficient, but the same processing logic on a big data set is very inefficient.  This is why I have said before: In oder to fully exploit the Oracle Exadata platform (or any current day platform) some re-engineering may be required. Do not be mistaken — I am not saying you need to re-engineer your applications for Exadata.  I am saying you will want to re-engineer your applications for Exadata as those applications simply were not designed to leverage the massively parallel processing that Exadata allows one to do.  It’s time to base design decisions based on today’s technology, not what was available when your application was designed.  Fast forward to now.

Oracle OpenWorld 2010: The Oracle Real-World Performance Group

Now that Oracle OpenWorld 2010 is just under 70 days away I thought I would take a moment to mention that the Oracle Real-World Performance Group will again be hosting three sessions.   This year I think we have a very exciting and informative lineup of sessions that are a must-attend for those wanting to see and hear Oracle Database performance insight right from Oracle’s own performance engineers.  Hope to see you there!

And for those who are interested, there will likely be many discussions about the Oracle Database Machine and Oracle Exadata.  Very hot stuff!

Session ID: S317164 (Monday 2:00PM)
Session Title: The Latest Real World Performance Challenges
Session Abstract: Oracle’s Real-World Performance Group — the group that first presented at Oracle OpenWorld parallel query techniques with partitions, the index-less database, cardinality challenges with the optimizer, over-processed databases and connection storms — this year presents the performance issues before you experience them and how to plan for future projects with success. All topics discussed in this session come from the Real-World Performance Group’s observations and problem solving.
Session ID: S317166 (Monday 5:00PM)
Session Title: Real-World Performance Panel Session
Session Abstract: This session is your chance, via written questions, to ask a panel stacked full of real-world performance talent all those questions to which you’ve just wanted to get a simple answer. You can write your questions on postcards available in the meeting room. Please focus on performance topics and not system debugging! 
Session ID: S317165 (Tuesday 2:00PM)
Session Title: Oracle Database Performance Secrets Finally Revealed
Session Abstract: Have you ever seen a real-world database performance engineer solve Oracle Database performance problems? Wouldn’t you like to know all the performance secrets they know? In this session, real-world database performance engineers will go over many of the performance secrets they use to get extreme performance out of Oracle Database. Not only will they tell you about these secrets but they will demo them for you as well. This session is specifically for those wanting to advance their database performance knowledge and experience.