A reader recently left a comment for which my reply was longer than I’d like to leave for a comment so I’m answering it in detail with this blog post.
Nice article. I am just reading the Netezza paper.
You don’t appear to have debunked the following statement.
“Exadata is unable to process this three table join in its MPP tier and instead must inefficiently move all the data required by the calculation across the network to Oracle RAC.”
Not many queries exist where data is only required from two tables. Are Oracle suggesting we need to change the way data is structured to enable best use of Exadata – increasing TCO significantly?
Thanks & Nice post.
There is a reason that I did not debunk that statement – it did not exist in the original version of Netezza’s paper. It seems they have taken the shopping basket example that I debunked in my previous post and replaced it with this one. Nonetheless lets take a look at Netezza’s claim:
Exadata’s storage tier provides Bloom filters to implement simple joins between one large and one smaller table, anything more complex cannot be processed in MPP. Analytical queries commonly require joins more complex than those supported by Exadata. Consider the straightforward case of an international retailer needing insight to the dollar value of sales made in stores located in the UK. This simple SQL query requires a join across three tables – sales, currency and stores.select sum(sales_value * exchange_rate) us_dollar_sales from sales, currency, stores where sales.day = currency.day and stores.country = 'UK' and currency.country = 'USA'
Exadata is unable to process this three table join in its MPP tier and instead must inefficiently move all the data required by the calculation across the network to Oracle RAC.
Before I comment, did you spot the error with the SQL query? Hint: Count the number of tables and joins.
Now that we can clearly see that Netezza marketing can not write good SQL because this query contains a cross product as there is no JOIN between sales and stores thus the value returned from this query is not “the [US] dollar value of sales made in stores located in the UK”, it’s some other rubbish number.
Netezza is trying to lead you to believe that sending data to the database nodes (running Oracle RAC) is a bad thing, which is most certainly is not. Let’s remember what Exadata is – Smart Storage. Exadata itself is not an MPP database, so of course it needs to send some data back to the Oracle database nodes where the Oracle database kernel can use Parallel Execution to easily parallelize the execution of this query in an MPP fashion efficiently leveraging all the CPUs and memory of the database cluster.
The reality here is that both Netezza and Oracle will do the JOIN in their respective databases, however, Oracle can push a Bloom filter into Exadata for the STORES.COUNTRY predicate so that the only data that is returned to the Oracle database are rows matching that criteria.
Let’s assume for a moment that the query is correctly written with two joins and the table definitions look like such (at least the columns we’re interested in):
create table sales ( store_id number, day date, sales_value number ); create table currency ( day date, country varchar2(3), exchange_rate number ); create table stores ( store_id number, country varchar2(3) ); select sum(sales.sales_value * currency.exchange_rate) us_dollar_sales from sales, currency, stores where sales.day = currency.day and sales.store_id = stores.store_id and stores.country = 'UK' and currency.country = 'USA'
For discussion’s sake, let’s assume the following:
- There is 1 year (365 days) in the SALES table of billions of rows
- There are 5000 stores in the UK (seems like a realistic number to me)
There is no magic in those numbers, it’s just something to add context to the discussion, so don’t think I picked them for some special reason. Could be more, could be less, but it really doesn’t matter.
So if we think about the the cardinality for the tables:
- STORES has a cardinality of 5000 rows
- CURRENCY has a cardinality of 365 rows (1 year)
The table JOIN order should be STORES -> SALES -> CURRENCY.
With Exadata what will happen is such:
- Get STORE_IDs from STORE where COUNTRY = ‘UK’
- Build a Bloom Filter of these 5000 STORE_IDs and push them into Exadata
- Scan SALES and apply the Bloom Filter in storage, retuning only rows for UK STORE_IDs and project only the necessary columns
- JOIN that result to CURRENCY
- Compute the SUM aggregate
All of these operations are performed in parallel using Oracle’s Parallel Execution.
Netezza suggests that Exadata can use Bloom filters for only two table joins (1 big, 1 small) and that analytical queries are more complex than that so Exadata can not use a Bloom filter and provide an example to suggest such. The reality is not only is their example incorrectly written SQL, it also works great with Exadata Bloom filters and it is more than 2 tables! In addition, it is a great demonstration of efficient and smart data movement as Exadata can smartly filter using Bloom filters and needs to only project a very few columns, thus likely creating a big savings versus sending all the columns/rows from the storage. Thus Exadata Bloom filters can work with complex analytical queries of more than two tables and efficiently send data across the network to the Oracle RAC cluster where Parallel Execution will work on the JOINs and aggregation in an MPP manor.
Now to specifically answer your question: No, Oracle is not suggesting you need to change your data/queries to support two table joins, Exadata will likely work fine with what you have today. And to let you and everyone else in on a little secret: Exadata actually supports applying multiple Bloom filters to a table scan (we call this a Bloom filter list denoted by the Predicate Information of a query plan by SYS_OP_BLOOM_FILTER_LIST), so you can have multiple JOIN filters being applied in the Exadata storage, so in reality Bloom filters are not even limited to just 2 table JOINs.
Oh well, so much for Netezza competitive marketing. Just goes to show that Netezza has a very poor understanding how Exadata really works (yet again).
I tend to keep the content of this blog fairly technical and engineering focused, but every now and then I have to venture off and do an editorial post. Recently some of the ParAccel management decided to fire up the FUD machine on the ParAccel blog and take aim at Oracle’s Exadata making the following claims:
“There are 12 SAS disks in the storage server with a speed of about 75 MB/s [The SUN Oracle Exadata Storage Server datasheet claims 125 MB/s but we think that is far-fetched.]” -Rick Glick, Vice President of Technology and Architecture (link)
“We stand by the 75MB/sec as a conservative, reliable number. We see higher numbers in disk tests, but never anywhere near 125MB/sec.” -Barry Zane, Chief Technology Officer (link)
Far Fetched Or Fact?
As a database performance engineer, I strive to be extremely detailed and well researched with my work. Clearly, these comments from Rick and Barry were not well researched as is evident from information publicly available on the Internet.
The first bit of documentation I would research before making such comments would be the hard disk drive specification sheet. The 12 drives in the Exadata Storage Server, a Sun Fire X4275, are 3.5-inch 15K RPM SAS 2.0 6Gb/sec 600GB drives. Looking at the drive spec sheet, it clearly states that the sustained sequential read is 122 MB/sec (at ID) to 204 MB/sec (at OD) [that's Inner Diameter & Outer Diameter]. Seems to me that Oracle’s claim of 1500MB/s per Exadata Storage Server (125MB/s for each of the 12 SAS drives) is certainly between 122MB/s and 204MB/s.
Now granted, one might think that vendors overstate their performance claims, so it may be resourceful to search the Internet for some third party evaluation of this hard disk. I went to a fairly well known Internet search engine to try find more information using a highly sophisticated and complex set of search keywords. To my astonishment, there at the top of the search results page was a write up by a third party. I would encourage reading the entire article, but if you want to just skip to page 5 [Benchmarks - HD Tune Pro] you will be presented with data that shows the minimum (120MB/s), average (167MB/s) and maximum (200MB/s) read throughput for sequential read tests performed by the author for the hard disk drive in dispute. Looks to me that those numbers are completely in line with the Sun spec sheet – no over exaggeration going on here. At this point there should be exactly zero doubt that the drives themselves, with the proper SAS controller, are easily physically capable of 125MB/s read rates and more.
Stand By Or Sit Down?
As I see it, there are three possibilities:
- Disk vendors are overly optimistic in their continuous sequential read rates.
- The newer class of SAS2 compatible 15Krpm drives and controllers are faster than the older generation we’ve measured.
- Our disk access patterns are not getting all the available performance.
Let’s drill into each of these possibilities:
- Perhaps vendors are overly optimistic, but how overly optimistic could they possibly be? I mean, really, 125MB/s is easily between the spec sheet rates of 122MB/s and 204MB/s. Truly 75MB/s is a low ball number for these drives. Even Exadata V1 SAS drives more than 75MB/s per drive and the HDD is not the limiting factor in the scan throughput (a good understanding of the hardware components should lead you to what is). Even the Western Digital 300GB 10K RPM VelociRaptor disk drive has benchmarks that show a maximum sequential data transfer rate of more than 120 MB/s and sustain a minimum of 75MB/s even on the innermost part of the platter, and that is a SATA drive commonly used in PCs!
- Barry states that ParAccel has no experience nor metrics (measurements) with these drives or seemingly any drives like them, but yet Barry calls “75MB/sec as a conservative, reliable number”. Just how reliable of a number can it possibly be when you have exactly zero data points and zero experience with the HDDs in dispute? Is this a debate that can be won by strength of personality or does it actually require data, numbers and facts?
- Perhaps the ParAccel database has disk access patterns that can not drive the scan rates that Exadata can, but should one assert that because ParAccel database may not drive that IO rate, Exadata can’t, even when said rate is within the realm of physical capability? I certainly would think not. Not unless the intention is simply to promote FUD.
So, as I see it, there are exactly two possibilities: Either one has technical knowledge on what they are talking about (and they have actual data/facts to support it) or they do not and they are just making things up. At this point I think the answer is quite clear in this situation; Rick and Barry had no data to support their (incorrect) assertions.
And The Truth Shall Be Revealed
Three weeks after Barry’s “three possibilities” comment, Barry reveals the real truth:
…we [ParAccel] have gotten a number of newer servers with SAS2 drives…[and] the newer generation of disk drives are faster than my experience…Exadata’s claim of 1500MB/sec per server seems completely reasonable…My apologies for any confusion created.
As it has come to pass, my assertion that ParAccel had absolutely no experience and thus no data to support their claims is validated (not that I really had any doubts). Spreading FUD generally does cause unnecessary confusion, but then again, that is usually the intention. I would expect such nonsense from folks with marketing in their title, but I hold a higher bar for people with technology in their titles. This was a simple debate about physical disk drive characteristics (and not software) and that is something anyone could get concrete factual data on (assuming they actually take the time and effort).
And Isn’t It Ironic… Don’t You Think?
The same day I read Barry’s “truth comment” I also read Jerome Pineau’s blog post on social media marketing. I could not help but recognize (and laugh about) the irony of the situation. Jerome lists several tips on being successful in SMM and the first two really stood out to me:
- Do not profess expertise on topics you know little about. Eventually, it will show.
- Always remain honest. Never lie. Your most important asset is credibility. You can fix almost any mistake except credibility damage.
Truly, truly ironic…