A reader recently left a comment for which my reply was longer than I’d like to leave for a comment so I’m answering it in detail with this blog post.
Nice article. I am just reading the Netezza paper.
You don’t appear to have debunked the following statement.
“Exadata is unable to process this three table join in its MPP tier and instead must inefficiently move all the data required by the calculation across the network to Oracle RAC.”
Not many queries exist where data is only required from two tables. Are Oracle suggesting we need to change the way data is structured to enable best use of Exadata – increasing TCO significantly?
Thanks & Nice post.
There is a reason that I did not debunk that statement – it did not exist in the original version of Netezza’s paper. It seems they have taken the shopping basket example that I debunked in my previous post and replaced it with this one. Nonetheless lets take a look at Netezza’s claim:
Exadata’s storage tier provides Bloom filters to implement simple joins between one large and one smaller table, anything more complex cannot be processed in MPP. Analytical queries commonly require joins more complex than those supported by Exadata. Consider the straightforward case of an international retailer needing insight to the dollar value of sales made in stores located in the UK. This simple SQL query requires a join across three tables – sales, currency and stores.select sum(sales_value * exchange_rate) us_dollar_sales from sales, currency, stores where sales.day = currency.day and stores.country = 'UK' and currency.country = 'USA'
Exadata is unable to process this three table join in its MPP tier and instead must inefficiently move all the data required by the calculation across the network to Oracle RAC.
Before I comment, did you spot the error with the SQL query? Hint: Count the number of tables and joins.
Now that we can clearly see that Netezza marketing can not write good SQL because this query contains a cross product as there is no JOIN between sales and stores thus the value returned from this query is not “the [US] dollar value of sales made in stores located in the UK”, it’s some other rubbish number.
Netezza is trying to lead you to believe that sending data to the database nodes (running Oracle RAC) is a bad thing, which is most certainly is not. Let’s remember what Exadata is – Smart Storage. Exadata itself is not an MPP database, so of course it needs to send some data back to the Oracle database nodes where the Oracle database kernel can use Parallel Execution to easily parallelize the execution of this query in an MPP fashion efficiently leveraging all the CPUs and memory of the database cluster.
The reality here is that both Netezza and Oracle will do the JOIN in their respective databases, however, Oracle can push a Bloom filter into Exadata for the STORES.COUNTRY predicate so that the only data that is returned to the Oracle database are rows matching that criteria.
Let’s assume for a moment that the query is correctly written with two joins and the table definitions look like such (at least the columns we’re interested in):
create table sales ( store_id number, day date, sales_value number ); create table currency ( day date, country varchar2(3), exchange_rate number ); create table stores ( store_id number, country varchar2(3) ); select sum(sales.sales_value * currency.exchange_rate) us_dollar_sales from sales, currency, stores where sales.day = currency.day and sales.store_id = stores.store_id and stores.country = 'UK' and currency.country = 'USA'
For discussion’s sake, let’s assume the following:
- There is 1 year (365 days) in the SALES table of billions of rows
- There are 5000 stores in the UK (seems like a realistic number to me)
There is no magic in those numbers, it’s just something to add context to the discussion, so don’t think I picked them for some special reason. Could be more, could be less, but it really doesn’t matter.
So if we think about the the cardinality for the tables:
- STORES has a cardinality of 5000 rows
- CURRENCY has a cardinality of 365 rows (1 year)
The table JOIN order should be STORES -> SALES -> CURRENCY.
With Exadata what will happen is such:
- Get STORE_IDs from STORE where COUNTRY = ‘UK’
- Build a Bloom Filter of these 5000 STORE_IDs and push them into Exadata
- Scan SALES and apply the Bloom Filter in storage, retuning only rows for UK STORE_IDs and project only the necessary columns
- JOIN that result to CURRENCY
- Compute the SUM aggregate
All of these operations are performed in parallel using Oracle’s Parallel Execution.
Netezza suggests that Exadata can use Bloom filters for only two table joins (1 big, 1 small) and that analytical queries are more complex than that so Exadata can not use a Bloom filter and provide an example to suggest such. The reality is not only is their example incorrectly written SQL, it also works great with Exadata Bloom filters and it is more than 2 tables! In addition, it is a great demonstration of efficient and smart data movement as Exadata can smartly filter using Bloom filters and needs to only project a very few columns, thus likely creating a big savings versus sending all the columns/rows from the storage. Thus Exadata Bloom filters can work with complex analytical queries of more than two tables and efficiently send data across the network to the Oracle RAC cluster where Parallel Execution will work on the JOINs and aggregation in an MPP manor.
Now to specifically answer your question: No, Oracle is not suggesting you need to change your data/queries to support two table joins, Exadata will likely work fine with what you have today. And to let you and everyone else in on a little secret: Exadata actually supports applying multiple Bloom filters to a table scan (we call this a Bloom filter list denoted by the Predicate Information of a query plan by SYS_OP_BLOOM_FILTER_LIST), so you can have multiple JOIN filters being applied in the Exadata storage, so in reality Bloom filters are not even limited to just 2 table JOINs.
Oh well, so much for Netezza competitive marketing. Just goes to show that Netezza has a very poor understanding how Exadata really works (yet again).
There seems to be little debate that Oracle’s launch of the Oracle Exadata Storage Server and the Sun Oracle Database Machine has created buzz in the database marketplace. Apparently there is so much buzz and excitement around these products that two competing vendors, Teradata and Netezza, have both authored publications that contain a significant amount of discussion about the Oracle Database with Real Application Clusters (RAC) and Oracle Exadata. Both of these vendor papers are well structured but make no mistake, these are marketing publications written with the intent to be critical of Exadata and discuss how their product is potentially better. Hence, both of these papers are obviously biased to support their purpose.
My intent with this blog post is simply to discuss some of the claims, analyze them for factual accuracy, and briefly comment on them. After all, Netezza clearly states in their publication:
The information shared in this paper is made available in the spirit of openness. Any inaccuracies result from our mistakes, not an intent to mislead.
In the interest of full disclosure, my employer is Oracle Corporation, however, this is a personal blog and what I write here are my own ideas and words (see disclaimer on the right column). For those of you who don’t know, I’m a database performance engineer with the Real-World Performance Group which is part of Server Technologies. I’ve been working with Exadata since before it was launched publicly and have worked on dozens of data warehouse proofs-of-concept (PoCs) running on the Exadata powered Sun Oracle Database Machine. My thoughts and comments are presented purely from an engineer’s standpoint.
The following writings are the basis of my discussion:
- Teradata: Exadata – the Sequel: Exadata V2 is Still Oracle
- Daniel Abadi: Defending Oracle Exadata
- Netezza: Oracle Exadata and Netezza TwinFin™ Compared
If you have not read Daniel Abadi’s blog post I strongly suggest you do before proceeding further. I think it is very well written and is presented from a vendor neutral point of view so there is no marketing gobbledygook to sort through. Several of the points in the Teradata writing which he discusses are also presented (or similarly presented) in the Netezza eBook, so you can relate his responses to those arguments as well. Since I feel Daniel Abadi did an excellent job pointing out the major flaws with the Teradata paper, I’m going to limit my discuss to the Netezza eBook.
Understanding Exadata Smart Scan
As a prerequisite for the discussion of the Netezza and Teradata papers, it’s imperative that we take a minute to understand the basics of Exadata Smart Scan. The Smart Scan optimizations include the following:
- Data Elimination via Storage Indexes
- Restriction/Row Filtering/Predicate Filtering
- Projection/Column Filtering
- Join Processing/Filtering via Bloom Filters and Bloom Pruning
The premise of these optimizations is reduce query processing times in the following ways:
- I/O Elimination – don’t read data off storage that is not needed
- Payload Reduction – don’t send data to the Oracle Database Servers that is not needed
OK. Now that you have a basic understanding, let’s dive into the claims…
Let’s discuss a few of Netezza claims against Exadata:
Claim: Exadata Smart Scan does not work with index-organized tables or clustered tables.
While this is a true statement, its intent is clearly to mislead you. Both of these structures are designed for OLTP workloads, not data warehousing. In fact, if one were to actually read the Oracle Database 11.2 documentation for index-organized tables you would find the following (source):
Index-organized tables are ideal for OLTP applications, which require fast primary key access
If one were to research table clusters you would find the Oracle Database 11.2 documentation offers the following guidelines (source):
Typically, clustering tables is not appropriate in the following situations:
- The tables are frequently updated.
- The tables frequently require a full table scan.
- The tables require truncating.
As anyone can see from the Oracle Database 11.2 Documentation, neither of these structures are appropriate for data warehousing.
Apparently this was not what Netezza really wanted you to know so they uncovered a note on IOTs from almost a decade ago, dating back to 2001 – Oracle 9i time frame, that while it clearly states:
[an IOT] enables extremely fast access to table data for primary key based [OLTP] queries
it also suggests that an IOT may be used as a fact table. Clearly this information is quite old and outdated and should probably be removed. What was a recommendation for Oracle Database 9i Release 1 in 2001 is not necessarily a recommendation for Oracle Database 11g Release 2 in 2010. Technology changes so using the most recent recommendations as a basis for discussion is appropriate, not some old, outdated stuff from nearly 10 years ago. Besides, the Oracle Database Machine runs version 11g Release 2, not 9i Release 1.
Bottom line: I’d say this “limitation” has an impact on a nice round number of Exadata data warehouse customers – exactly zero (zero literally being a round number). IOTs and clustered tables are both structures optimized for fast primary key access, like the type of access in OLTP workloads, not data warehousing. The argument that Smart Scan does not work for these structures is really no argument at all.
Claim: Exadata Smart Scan does not work with the TIMESTAMP datatype.
Phil Francisco seems to have left out some very important context in making this accusation, because this is not at all what the cited blog post by Christian Antognini discusses. This post clearly states the discussion is about:
What happens [with Smart Scan] when predicates contain functions or expressions?
Nowhere at all does that post make an isolated reference that Smart Scan does not work with the TIMESTAMP datatype. What this blog post does state is this:
when a TIMESTAMP datatype is involved [with datetime functions], offloading almost never happens
While the Netezza paper references what the blog post author has written, some very important context has been omitted. In doing so, Netezza has taken a specific reference and turned it into a misleading generalization.
The reality is that Smart Scan does indeed work for the TIMESTAMP datatype and here is a basic example to demonstrate such:
SQL> describe t Name Null? Type -------------- -------- ------------------ ID NOT NULL NUMBER N NUMBER BF BINARY_FLOAT BD BINARY_DOUBLE D DATE T TIMESTAMP(6) S VARCHAR2(4000) SQL> SELECT * FROM t WHERE t = to_timestamp('01-01-2010','DD-MM-YYYY'); Execution Plan ---------------------------------------------------------- Plan hash value: 1601196873 ---------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 52 | 4 (0)| 00:00:01 | |* 1 | TABLE ACCESS STORAGE FULL| T | 1 | 52 | 4 (0)| 00:00:01 | ---------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - storage("T"=TIMESTAMP' 2010-01-01 00:00:00.000000000') filter("T"=TIMESTAMP' 2010-01-01 00:00:00.000000000')
You can see that the Smart Scan offload is taking place by the presence of the storage clause (highlighted) in the Predicate Information section above. What Christian Antognini did observe is bug 9682721 and the bugfix resolves the datetime function offload issues for all but a couple scenarios (which he blogs about here) and those operations can (and usually are) expressed differently. For example, an expression using ADD_MONTHS() can easily be expressed using BETWEEN.
Bottom line: Exadata Smart Scan does work with the TIMESTAMP datatype.
Claim: When transactions (insert, update, delete) are operating against the data warehouse concurrent with query activity, smart scans are disabled. Dirty buffers turn off smart scan.
Yet again, Netezza presents only a half-truth. While it is true that an active transaction disables Smart Scan, they fail to further clarify that Smart Scan is only disabled for those blocks that contain an active transaction – the rest of the blocks are able to be Smart Scanned. The amount of data that is impacted by insert, update, delete will generally be a very small fraction of the total data in a data warehouse. Also, data that is inserted via direct path operations is not subject to MVCC (the method Oracle uses for read consistency) as the blocks that are used are new blocks so no read consistent view is needed.
Bottom line: While this claim is partially true, it clearly attempts to overstate the impact of this scenario in a very negative way. Not having Smart Scan for small number of blocks will have a negligible impact on performance.
Also see Daniel Abadi: Exadata does NOT Support Active Data Warehousing
Claim: Using [a shared-disk] architecture for a data warehouse platform raises concern that contention for the shared resource imposes limits on the amount of data the database can process and the number of queries it can run concurrently.
It is unclear what resource Netezza is referring to here, it simply states “the shared resource”. You know the one? Yes, that one… Perhaps they mean the disks themselves, but that is completely unknown. Anyway…
Exadata uses at least a 4 MB Automatic Storage Management (ASM) allocation unit (AU) [more on ASM basics]. This means that there is at least 4 MB of contiguous physical data laid out on the HDD which translates into 4 MB of contiguous data streamed off of disk for full table scans before the head needs to perform a seek. With such large I/O requests the HDDs are able to spend nearly all the time transferring data, and very little time finding it and that is what matters most. Clearly if Exadata is able to stream data off of disk at 125 MB/s per disk (near physics speed for this type of workload) then any alleged “contention” is really not an issue. In many multi-user data warehouse workloads for PoCs, I’ve observed that each Exadata Storage Server is able to perform very close or at the data sheet physical HDD I/O rate of 1500 MB/s per server.
Bottom line: The scalability differences between shared-nothing and shared-disk are very much exaggerated. By doing large sequential I/Os the disk spends its time returning data, not finding it. Simply put – there really is no “contention”.
Also see Daniel Abadi: 1) Exadata does NOT Enable High Concurrency & 2) Exadata is NOT Intelligent Storage; Exadata is NOT Shared-Nothing
Claim: Analytical queries, such as “find all shopping baskets sold last month in Washington State, Oregon and California containing product X with product Y and with a total value more than $35″ must retrieve much larger data sets, all of which must be moved from storage to database.
I find it so ironic that Netezza mentions this type of query as nearly an identical (but more complex) one was used by my group at Oracle OpenWorld 2009 in our The Terabyte Hour with the Real-World Performance Group session. The exact analytical query we ran live for the audience to demonstrate the features of Oracle Exadata and the Oracle Database Machine was, “What were the most popular items in the baskets of shoppers who visited stores in California in the first week of May and did not buy bananas?”
Let’s translate the Netezza analytical question into some tables and SQL to see what the general shape of this query may look like:
select count(*) -- using count(*) for simplicity of the example from ( select td.transaction_id, sum(td.sales_dollar_amt) total_sales_amt, sum(case when p.product_description in ('Brand #42 beer') then 1 else 0 end) count_productX, sum(case when p.product_description in ('Brand #42 frozen pizza') then 1 else 0 end) count_productY from transaction_detail td join d_store s on (td.store_key = s.store_key) join d_product p on (td.product_key = p.product_key) where s.store_state in ('CA','OR','WA') and td.transaction_time >= timestamp '2010-07-01 00:00:00' and td.transaction_time < timestamp '2010-08-01 00:00:00' group by td.transaction_id ) x where total_sales_amt > 35 and count_productX > 0 and count_productY > 0
To me, this isn’t a particularly complex analytical question/query. As written, it’s just a 3 table join (could be 4 if I added a D_DATE I suppose), but it doesn’t require anything fancy – just a simple GROUP BY with a CASE in the SELECT to count how many times Product X and Product Y appear in a given basket.
Netezza claims that analytical queries like this one must move all the data from storage to the database, but that simply is not true. Here is why:
- Simple range partitioning on the event timestamp (a very common data warehousing practice for those databases that support partitioning), or even Exadata Storage Indexes, will eliminate any I/O for data other than the one month window that is required for this query.
- A bloom filter can be created and pushed into Exadata to be used as a storage filter for the list of STORE_KEY values that represent the three state store restriction.
Applying both of #1 and #2, the only data that is returned to the database for the fact table are rows for stores in Washington State, Oregon and California for last month. Clearly this is only a subset of the data for the entire fact table.
This is just one example, but there are obviously different representations of the same data and query that could be used. I chose what I thought was the most raw, unprocessed, uncooked form simply because Netezza seems to boast about brute force type of operations. Even then, considering a worst case scenario, Exadata does not have to move all the data back to the database. Other data/table designs that I’ve seen from customers in the retail business would allow even less data to be returned.
Bottom line: There are numerous ways that Exadata can restrict the data that is set to the database servers and it’s likely that any query with any predicate restrictions can do so. Certainly it is possible even with the analytic question that Netezza mentions.
Claim: To evenly distribute data across Exadata’s grid of storage servers requires administrators trained and experienced in designing, managing and maintaining complex partitions, files, tablespaces, indices, tables and block/extent sizes.
Interestingly enough, the author of the Teradata paper seems to have a better grasp than Netezza on how data distribution and ASM work describing it on page 9:
Distribution of data on Exadata storage is managed by Oracle’s Automatic Storage Manager (ASM). By default, ASM stripes each Oracle data partition across all available disks on every Exadata cell.
So if by default ASM evenly stripes data across all available disks on Exadata Storage Server (and it does, in a round robin manner) what exactly is so difficult here? What training and experience is really required for something that does data distribution automatically? I can only assert that Phil Francisco has not even read the Teradata paper (but it would seem he has since he even mentions it on his blog), let alone Introduction to Oracle Automatic Storage Management. It’s claims like this that really make me question how genuine his “no intent to mislead” statement really is.
Bottom line: Administrators need not worry about data distribution with Exadata and ASM – it is done automatically and evenly for you.
I’m always extremely reluctant to believe much of what vendors say about other vendors, especially when they preface their publication with something like: “One caveat: Netezza has no direct access to an Exadata machine“, and “Any inaccuracies result from our mistakes, not an intent to mislead” yet they still feel qualified enough to write about said technology and claim it as fact. I also find it interesting that both Teradata and Netezza have published anti-Exadata papers, but neither Netezza nor Teradata have published anti-vendor papers about each other (that I know of). Perhaps Exadata is much more of a competitor than either of them let on. They do protest too much, methinks.
The list of claims I’ve discussed certainly is not an exhaustive list by any means but I think it is fairly representative of the quality found in Netezza’s paper. While sometimes the facts are correct, the arguments are overstated and misleading. Other times, the facts are simply wrong. Netezza clearly attempts to create the illusion of problems simply where they do not exist.
Hopefully this blog post has left you a more knowledgeable person when it comes to Oracle and Exadata. I’ve provided fact and example wherever possible and kept assertions to a minimum.
I’d like to end with a quote from Daniel Abadi’s response to the Teradata paper which I find more than applicable to the Netezza paper as well:
Many of the claims and inferences made in the paper about Exadata are overstated, and the reader needs to be careful not to be mislead into believing in the existence problems that don’t actually present themselves on realistic data sets and workloads.
Courteous and professional comments are welcome. Anonymous comments are discouraged. Snark and flame will end up in the recycle bin. Comment accordingly.
Oracle Corporation had its F4Q09 earnings call today and the Exadata comments started right away with the earnings press release:
“The Exadata Database Machine is well on its way to being the most successful new product launch in Oracle’s 30 year history,” said Oracle CEO Larry Ellison. “Several of Teradata’s largest customers are performance testing — then buying — Oracle Exadata Database Machines. In a recent competitive benchmark, a Teradata machine took over six hours to process a query that our Exadata Database Machine ran in less than 30 minutes. They bought Exadata.”
During the earnings call Larry Ellison discusses Exadata and the competition:
…I’m going to talk about Exadata again. I said last quarter that Exadata is shaping up to be our most exciting and successful new product introduction in Oracle’s 30 year history and [in the] last quarter Exadata continues to grow and win competitive deals in the marketplace against our three primarily competitors. It’s turning out that Teradata is our number one competitor…Netezza and IBM are kind of tied for second.
Ellison describes some of the Exadata sales from this quarter which include:
- A well-known California SmartPhone and computer manufacturer (win vs. Netezza) who commented that Exadata ran about 100 times faster in some cases then their standard Oracle environment
- Research in Motion
- A large East Coast insurance company
- Thomson Reuters
- A Japanese telco (biggest Teradata customer in Japan) who benchmarked Exadata and found it to be dramatically faster then Teradata
- Barclays Capital (UK)
- A number of banks in Western Europe and Germany
Larry Ellison follows with:
It was just a great quarter for Exadata, a product that is relatively new to the marketplace that is persuading people to move from their existing environments because Exadata is faster and the hardware costs less.
In the Q&A Larry Ellison responds to John DiFucci on Exadata:
By the way every customer I mentioned and alluded to were actual sales. Now some of these, because the Exadata product is so new, quite often will install in kind of a try and buy situation, but I can’t think of a case where we installed the machine that they didn’t buy. So we’re winning these benchmarks. Sometimes we’re beating Teradata. I think in my quote, I said we’ve beat Teradata on one of the queries by 20 to one. So we think it’s a brand new technology, we think we’re a lot faster then the competition. The benchmarks are proving out with real customer data, we’re proving to be much faster then the competition. Every single deal I mentioned were cases where the customer bought the system. There are obviously other evaluations going on and we expect the Exadata sales to accelerate.
A few weeks ago I read Curt Monash’s report on interpreting the results of data warehouse proofs-of-concept (POCs) and I have to say, I’m quite surprised that this topic hasn’t been covered more by analysts in the data warehousing space. I understand that analysts are not database performance engineers, but where do they think that the performance claims of 10x to 100x or more come from? Do they actually investigate these claims or just report on them? I can not say that I have ever seen any database analyst offer any technical insight into these boasts of performance. If some exist be sure to leave a comment and point me to them.
Oracle Exadata Performance Architect Kevin Closson has blogged about a 485x performance increase of Oracle Exadata vs. Oracle Exadata and his follow-up post to explain exactly where the 485x performance gain comes from gave me the nudge to finish this post that had been sitting in my drafts folder since I first read Curt’s post.
Customer Bechmarketing Claims
I thought I would compile a list of what the marketing folks at other database vendors are saying about the performance of their products. Each of these statements have been taken from the given vendor’s website.
- Netezza: 10-100 times faster than traditional solutions…but it is not uncommon to see performance differences as large as 200x to even 400x or more when compared to existing Oracle systems
- Greenplum: often 10 to 100 times faster than traditional solutions
- DATAllegro: 10-100x performance over traditional platforms
- Vertica: Performs 30x-200x faster than other solutions
- ParAccel: 20X – 200X performance gains
- EXASolution: can perform up to 100 times faster than with traditional databases
- Kognitio WX2: Tests have shown to out-perform other database / data warehouse solutions by 10-60 times
Certainly seems these vendors are a positioning themselves against traditional database solutions, whatever that means. And differences as large as 400x against Oracle? What is it exactly they are comparing?
Investigative Research On Netezza’s Performance Claims
Using my favorite Internet search engine I came across this presentation by Netezza dated October 2007. On slide 21 Netezza is comparing an NPS 8150 (112 SPU, up to 4.5 TB of user data) server to IBM DB2 UDB on a p680 with 12 CPUs (the existing solution). Not being extremely familiar with the IBM hardware mentioned, I thought I’d research to see exactly what an IBM p680 server consists of. The first link in my search results took me to here where the web page states:
The IBM eServer pSeries 680 has been withdrawn from the market, effective March 28, 2003.
Searching a bit more I came across this page which states that the 12 CPUs in the pSeries 680 are RS64 IV microprocessors. According to Wikipedia the “RS64-IV or Sstar was introduced in 2000 at 600 MHz, later increased to 750 MHz”. Given that at best, the p680 had 12 CPUs running at 750 MHz and the NPS 8150 had 112 440GX PowerPC processors I would give the compute advantage to Netezza by a significant margin. I guess it is cool to brag how your most current hardware beat up on some old used and abused server who has already been served its end-of-life notice. I found it especially intriguing that Netezza is boasting about beating out an IBM p680 server that has been end-of-lifed more than four years prior to the presentation’s date. Perhaps they don’t have any more recent bragging to do?
Going back one slide to #20 you will notice a comparison of Netezza and Oracle. Netezza clearly states they used a NPS 8250 (224 SPUs, up to 9 TB of user data) against Oracle 10g RAC running on Sun/EMC. Well ok…Sun/EMC what??? Obviously there were at least 2 Sun servers, since Oracle 10g RAC is involved, but they don’t mention the server models at all, nor the storage, nor the storage connectivity to the hosts. Was this two or more Sun Netra X1s or what??? Netezza boasts a 449x improvement in a “direct comparison on one day’s worth of data”. What exactly is being compared is up to the imagination. I guess this could be one query or many queries, but the marketeers intentionally fail to mention. They don’t even mention the data set size being compared. Given that Netezza can read data off the 224 drives at 60-70 MB/s, the NPS 8250 has a total scan rate of over 13 GB/s. I can tell you first hand that there are very few Sun/EMC solutions that are configured to support 13 GB/s of I/O bandwidth. Most configurations of that vintage probably don’t support 1/10th of that I/O bandwidth (1.3 GB/s).
Here are a few more comparisons that I have seen in Netezza presentations:
- NPS 8100 (112 SPUs/4.5 TB max) vs. SAS on Sun E5500/6 CPUs/6GB RAM
- NPS 8100 (112 SPUs/4.5 TB max) vs. Oracle 8i on Sun E6500/12 CPUs/8 GB RAM
- NPS 8400 (448 SPUs/18 TB max) vs. Oracle on Sun (exact hardware not mentioned)
- NPS 8100 (112 SPUs/4.5 TB max) vs. IBM SP2 (database not mentioned)
- NPS 8150z (112 SPUs/5.5 TB max) vs. Oracle 9i on Sun/8 CPUs
- NPS 8250z (224 SPUs/11 TB max) vs. Oracle 9i on Sun/8 CPUs
As you can see, Netezza has a way of finding the oldest hardware around and then comparing it to its latest, greatest NPS. Just like Netezza slogan, [The Power to ]Question Everything™, I suggest you question these benchmarketing reports. Database software is only as capable as the hardware it runs on and when Netezza targets the worst performing and oldest systems out there, they are bound to get some good marketing numbers. If they compete against the latest, greatest database software running on the latest, greatest hardware, sized competitively for the NPS being used, the results are drastically different. I can vouch for that one first hand having done several POCs against Netezza.
One Benchmarketing Claim To Rule Them All
Now, one of my favorite benchmarketing reports is one from Vertica. Michael Stonebraker’s blog post on customer benchmarks contains the following table:
Take a good look at the Query 2 results. Vertica takes a query running in the current row store from running in 4.5 hours (16,200 seconds) to 1 second for a performance gain of 16,200x. Great googly moogly batman, that is reaching ludicrous speed. Heck, who needs 100x or 400x when you do 16,200x. That surely warrants an explanation of the techniques involved there. It’s much, much more than simply column store vs. row store. It does raise the question (at least to me): why Vertica doesn’t run every query in 1 second. I mean, come on, why doesn’t that 19 minute row store query score better than a 30x gain? Obviously there is a bit of the magic pixie dust going on here with, what I would refer to as “creative solutions” (in reality it is likely just a very well designed projection/materaizied view, but by showing the query and telling us how it was possible would make it less unimpressive [sic]).
What Is Really Going On Here
First of all, you will notice that not one of these benchmarketing claims is against a vendor run system. Each and every one of these claims are against existing customer systems. The main reason for this is that most vendors prohibit benchmark results being published with out prior consent from the vendor in the licensing agreement. Seems the creative types have found that taking the numbers from the existing, production system is not prohibited in the license agreement so they compare that to their latest, greatest hardware/software and execute or supervise the execution of a benchmark on their solution. Obviously this is a one sided apples to bicycles comparison, but quite favorable for bragging rights for the new guy.
I’ve been doing customer benchmarks and proof of concepts (POCs) for almost 5 years at Oracle. I can guarantee you that Netezza has never even come close to getting 10x-100x the performance over Oracle running on a competitive hardware platform. Now I can say that it is not uncommon for Oracle running on a balanced system to perform 10x to 1000x (ok, in extreme cases) over an existing poorly performing Oracle system. All it takes is to have a very unbalanced system with no I/O bandwidth, not be using parallel query, not use compression, poor or no use of partitioning and you have created a springboard for any vendor to look good.
One More Juicy Marketing Tidbit
While searching the Internet for creative marketing reports I have to admit that the crew at ParAccel probably takes the cake (and not in an impressive way). On one of their web pages they have these bullet points (plus a few more uninteresting ones):
- All operations are done in parallel (A non-parallel DBMS must scan all of the data sequentially)
- Adaptive compression makes disks faster…
Ok, so I can kinda, sorta see the point that a non-parallel DBMS must do something sequentially…not sure how else it would do it, but then again, I don’t know any enterprise database that is not capable of parallel operations. However, I’m going to need a bit of help on the second point there…how exactly does compression make disks faster? Disks are disks. Whether or not compression is involved has nothing to do with how fast a disk is. Perhaps they mean that compression can increase the logical read rate from a disk given that compression allows more data to be stored in the same “space” on the disk, but that clearly is not what they have written. Reminds me of DATAllegro’s faster-than-wirespeed claims on scan performance. Perhaps these marketing guys should have their numbers and wording validated by some engineers.
Do You Believe In Magic Or Word Games?
Creditable performance claims need to be accounted for and explained. Neil Raden from Hired Brains Research offers guidance for evaluating benchmarks and interpreting market messaging in his paper, Questions to Ask a Data Warehouse Appliance Vendor. I think Neil shares the same opinion of these silly benchmarketing claims. Give his paper a read.
I usually stick to the technical stuff but today is strictly about entertainment. One has to laugh when a company (Tesco) is involved in two press releases in two days with two different database vendors and the press releases are a bit conflicting. I’d say this appears to be a case of open mouth, insert foot.
Sit back, relax, and enjoy the show…
On September 9, 2008 Netezza put out a press release entitled “Netezza Bags Tesco“. In this press release Marcel Borlin, Programme Manager at Tesco states:
“We currently have around 25 heavy analytical users running large queries on Netezza. They are analysing transactional discrepancies across millions of items and item movements every day, right through the supply chain; from stores to distribution centres. The specific application is designed to find wastage such as stolen, destroyed, out-of-date or lost items. It is an important function, as it affects the financial position of the Company. Yet, on the Teradata platform we had to allocate time slots for running these analyses, or the system would grind to a halt.”
On September 11, 2008 Tesco put out a press release entitled “Tesco Reconfirms Commitment to Teradata“. In this press release Marcel Borlin, Programme Manager at Tesco states:
“Tesco is satisfied with the consistently high performance delivered by Teradata and particularly its proven ability to run a mixed workload of management information and complex analytics. As part of our ongoing programme to evaluate different technologies and architectures, we have deployed some analytical users on a small Netezza system which has been reported in the press recently. This implementation does not reflect any dissatisfaction with Teradata. Tesco continues to add new applications to the Teradata EDW which has now grown to 60 terabytes. The EDW is providing Management Information Systems for Commercial Reporting, Supply Chain and Stores as well as Tescolink thereby providing information to 8000 people across more than 2000 suppliers.”
I’ll keep my comments at this: I surely would not want to be Marcel Borlin.