Today EMC Greenplum (I guess that is the “official” name since the acquisition) launched their new product offering and as part of that announcement they published some performance numbers around data loading rates. Let’s examine what’s behind this loading rate number.
Benchmarks and benchmark results are often criticized (and sometimes rightfully so) because they often are (over) engineered to prove a point and may include optimizations or techniques that would be uncommon in day to day operations. I think most everyone knows and agrees with that. In the interest of providing benchmark numbers that are not over engineered, Greenplum states the following :
Greenplum and the EMC Data Computing Products Division are now producing real world benchmarks. No more obscure tests against formula-one tuning. Instead we would like to present the beginning of what we are calling real-world benchmarks. These benchmarks are designed to reflect the true customer experience and conform to the following guiding principles:
- Test on the system as it leaves the factory, not the laboratory.
- Create data types and schemas that match real-world use cases.
- Consider options beyond raw bulk loading.
I think that list of good intentions is commendable, especially since I fondly remember EMC data sheets that had IOPS rates that were 100% from the array cache. Hopefully those days are behind them.
The Data Loading Rate Claim
As part of Greenplum’s data loading rate claims, there are two papers written up by Massive Data News that contain some details about these data loading benchmarks, one for Internet and Media , and one for Retail . Below is the pertinent information.
Option (Internet and Media)
|Segment servers||16 (standard GP1000 configuration)||16 (standard GP1000 configuration)|
|Table format||Quicklz columnar||Quicklz columnar|
|Mirrors||Yes (two copies of the data)||Yes (two copies of the data)|
|ETL hosts||20, 2 URLs each||20, 2 URLs each|
|Row width||616 bytes/row||666 bytes/row|
The only difference between these two tables is the number/types of columns and the row width.
Results (Internet and Media)
|Rows per Second||5,530,000||4,770,000|
|TB per Hour||11.16||10.4|
|Time to load 1 Billion rows||3.01 Minutes||3.48 Minutes|
Value (Internet and Media)
|Total Flat File Size||198 GB||214 GB|
|Data per ETL host||9.9 GB||10.7 GB|
|Time to load 320M rows||0.9632 Minutes||1.1136 Minutes|
When I looked over these metrics the following stood out to me:
- Extrapolation is used to report the time to load 1 billion rows (only 320 million are actually loaded).
- Roughly 60 seconds of loading time is used to extrapolate an hourly loading rate.
- The source file size per ETL host is very small; small enough to fit entirely in the file system cache.
Now why are these “red flags” to me for a “real-world” benchmark?
- Extrapolation always shows linear rates. If Greenplum wants to present a real-world number to load 1 billion rows, then load at least 1 billion rows. It can’t be that hard, can it?
- Extrapolation of the loading rate is at a factor of ~60x (extrapolating a 1 hour rate from 1 minute of execution). I’d be much more inclined to believe/trust a rate that was only extrapolated 2x or 4x, but 60x is way too much for me.
- If the source files fit entirely into file system cache, no physical I/O needs to be done to stream that data out. It should be fairly obvious that no database can load data faster than the source system can deliver that data, but at least load more data than aggregate memory on the ETL nodes to eliminate the fully cached effect.
- There are 20 ETL nodes feeding 16 Greenplum Segment nodes. Do real-world customers have more ETL nodes than database nodes? Maybe, maybe not.
- No configuration is listed for the ETL nodes.
Now don’t get me wrong. I’m not challenging that EMC Greenplum Data Computing Appliance can’t do what is claimed. But surely the data that supports those claims has significant room for improvement, especially for a company that is claiming to be in favor of open and real-world benchmarks. Hopefully we see some better quality real-world benchmarks from these guys in the future.
The Most Impressive Metric
Back on June 17th WordPress 3.0 “Thelonious” was released and it offered up a handful of new features. Just a few days ago (July 29th) the 3.0.1 release went GA so I decided it was time to investigate what the new 3.0 ready themes had to offer. After looking through a handful of themes I decided to give the Magazine Basic theme a try for now. It offered a 1024 pixel wide layout and threaded comments; two of the features I was really looking for. Feel free to share your comments: good, bad or otherwise. Thanks!
Here is a capture of the previous version just in case you don’t recall what it looked like (click for full size).
February 23, 2009 marks the 2nd anniversary of the Structured Data blog. Looking back, I guess I forgot to have a first anniversary, but oh well, I guess I’ll celebrate twice as much this year to make up for it. The past two years have been fun and I’ve enjoyed sharing my experiences and interacting with the readers of this blog. I do wish I had a bit more time to commit to writing but I guess it is what it is. Being a bit of a perfectionist, I’d rather focus on quality than quantity. The past year has been a fun, but busy one, especially with the launch of Oracle Exadata and the HP Oracle Database Machine. I think the rest of 2009 will continue to be a busy and exciting one and I hope to have several more Exadata posts.
The Past Year
Looking back over the past year the top visited blog posts have been:
- Choosing An Optimal Stats Gathering Strategy
- Troubleshooting Bad Execution Plans
- Top Ways How Not To Scale Your Data Warehouse
I’m glad to see the first two of those have been frequently visited because I think they cover topics that are extremely helpful in dealing with the root causes of bad execution plans, a problem that many DBAs face. Contrary to what some “experts” would like to have you believe, there are no silver bullets or magic involved with query plan tuning. It all comes down to understanding how things work and I hope that those posts give you insight into exactly that.
The Next Year
One of the main reasons that I started this blog was to share my work experiences with the Oracle community in hopes to explain and help DBAs use the Oracle database software optimally as well as to assist in troubleshooting common, but often difficult issues. In conjunction with my modus operandi, I’d like to offer up the Topic Suggestions page. Feel free to use this page to suggest topics that you would like to have further explained or would like discussion of.
Thanks to all of you who have read and contributed to this blog and may the next year be a great one.
I’ve decided it was time for a new look for Structured Data so I’ve updated the theme. I think I have looked at enough posts and all looks well but if you find something that appears odd, unreadable or otherwise, do let me know by posting a comment on this thread and include the link to the problematic post.
One of the main reasons I chose this theme was for the java script that allows links back to a comment that one has replied to as well as the option to quote previous comments as well.
On the down side the Trackbacks are hidden by default and one has to choose either Trackbacks or Comments to be viewed. I may modify this as time permits. Time came sooner than I had originally anticipated.
Feel free to leave a comment with your thoughts, especially if you think this version of the site lacks something the previous version had.
I usually stick to the technical stuff but today is strictly about entertainment. One has to laugh when a company (Tesco) is involved in two press releases in two days with two different database vendors and the press releases are a bit conflicting. I’d say this appears to be a case of open mouth, insert foot.
Sit back, relax, and enjoy the show…
On September 9, 2008 Netezza put out a press release entitled “Netezza Bags Tesco“. In this press release Marcel Borlin, Programme Manager at Tesco states:
“We currently have around 25 heavy analytical users running large queries on Netezza. They are analysing transactional discrepancies across millions of items and item movements every day, right through the supply chain; from stores to distribution centres. The specific application is designed to find wastage such as stolen, destroyed, out-of-date or lost items. It is an important function, as it affects the financial position of the Company. Yet, on the Teradata platform we had to allocate time slots for running these analyses, or the system would grind to a halt.”
On September 11, 2008 Tesco put out a press release entitled “Tesco Reconfirms Commitment to Teradata“. In this press release Marcel Borlin, Programme Manager at Tesco states:
“Tesco is satisfied with the consistently high performance delivered by Teradata and particularly its proven ability to run a mixed workload of management information and complex analytics. As part of our ongoing programme to evaluate different technologies and architectures, we have deployed some analytical users on a small Netezza system which has been reported in the press recently. This implementation does not reflect any dissatisfaction with Teradata. Tesco continues to add new applications to the Teradata EDW which has now grown to 60 terabytes. The EDW is providing Management Information Systems for Commercial Reporting, Supply Chain and Stores as well as Tescolink thereby providing information to 8000 people across more than 2000 suppliers.”
I’ll keep my comments at this: I surely would not want to be Marcel Borlin.
Up until this point the only public information sharing I have done has been via the Real-World Performance Round-table discussions at Oracle OpenWorld and, most recently, at the NoCOUG Winter Conference. Going forward I would like to use this blog as an additional channel for discussions with the Oracle user community.
Just a bit of background on me… I’ve been a member of the Real World Performance Group at Oracle since 2004 and I have worn the DBA and Systems/Storage Admin hats prior to that. My role in the Real-World Performance Group is that of a database performance engineer. I have been concentrating on data warehousing and spend most of my time doing data warehouse benchmarks for customers that have varied in size from 1TB to more than 70TB. I’ve also worked on OLTP benchmarks as well as performance related escalations.