Today EMC Greenplum (I guess that is the “official” name since the acquisition) launched their new product offering and as part of that announcement they published some performance numbers around data loading rates. Let’s examine what’s behind this loading rate number.
Benchmarks and benchmark results are often criticized (and sometimes rightfully so) because they often are (over) engineered to prove a point and may include optimizations or techniques that would be uncommon in day to day operations. I think most everyone knows and agrees with that. In the interest of providing benchmark numbers that are not over engineered, Greenplum states the following :
Greenplum and the EMC Data Computing Products Division are now producing real world benchmarks. No more obscure tests against formula-one tuning. Instead we would like to present the beginning of what we are calling real-world benchmarks. These benchmarks are designed to reflect the true customer experience and conform to the following guiding principles:
- Test on the system as it leaves the factory, not the laboratory.
- Create data types and schemas that match real-world use cases.
- Consider options beyond raw bulk loading.
I think that list of good intentions is commendable, especially since I fondly remember EMC data sheets that had IOPS rates that were 100% from the array cache. Hopefully those days are behind them.
The Data Loading Rate Claim
As part of Greenplum’s data loading rate claims, there are two papers written up by Massive Data News that contain some details about these data loading benchmarks, one for Internet and Media , and one for Retail . Below is the pertinent information.
|Configuration||Option (Internet and Media)||Option (Retail)|
|Segment servers 16||(standard GP1000 configuration)||16 (standard GP1000 configuration)|
|Table format||Quicklz columnar||Quicklz columnar|
|Mirrors||Yes (two copies of the data)||Yes (two copies of the data)|
|ETL hosts||20, 2 URLs each||20, 2 URLs each|
|Row width||616 bytes/row||666 bytes/row|
The only difference between these two tables is the number/types of columns and the row width.
|Metric||Results (Internet and Media)||Results (Retail)|
|Rows per Second||5,530,000||4,770,000|
|TB per Hour||11.16||10.4|
|Time to load 1 Billion rows||3.01 Minutes||3.48 Minutes|
|Metric||Value (Internet and Media)||Value (Retail)|
|Total Flat File Size||198 GB||214 GB|
|Data per ETL host||9.9 GB||10.7 GB|
|Time to load 320M rows||0.9632 Minutes||1.1136 Minutes|
When I looked over these metrics the following stood out to me:
- Extrapolation is used to report the time to load 1 billion rows (only 320 million are actually loaded).
- Roughly 60 seconds of loading time is used to extrapolate an hourly loading rate.
- The source file size per ETL host is very small; small enough to fit entirely in the file system cache.
Now why are these “red flags” to me for a “real-world” benchmark?
- Extrapolation always shows linear rates. If Greenplum wants to present a real-world number to load 1 billion rows, then load at least 1 billion rows. It can’t be that hard, can it?
- Extrapolation of the loading rate is at a factor of ~60x (extrapolating a 1 hour rate from 1 minute of execution). I’d be much more inclined to believe/trust a rate that was only extrapolated 2x or 4x, but 60x is way too much for me.
- If the source files fit entirely into file system cache, no physical I/O needs to be done to stream that data out. It should be fairly obvious that no database can load data faster than the source system can deliver that data, but at least load more data than aggregate memory on the ETL nodes to eliminate the fully cached effect.
- There are 20 ETL nodes feeding 16 Greenplum Segment nodes. Do real-world customers have more ETL nodes than database nodes? Maybe, maybe not.
- No configuration is listed for the ETL nodes.
Now don’t get me wrong. I’m not challenging that EMC Greenplum Data Computing Appliance can’t do what is claimed. But surely the data that supports those claims has significant room for improvement, especially for a company that is claiming to be in favor of open and real-world benchmarks. Hopefully we see some better quality real-world benchmarks from these guys in the future.