EMC Greenplum Data Computing Appliance Real World Benchmarks

Today EMC Greenplum (I guess that is the “official” name since the acquisition) launched their new product offering and as part of that announcement they published some performance numbers around data loading rates. Let’s examine what’s behind this loading rate number.

Real-World Benchmarks

Benchmarks and benchmark results are often criticized (and sometimes rightfully so) because they often are (over) engineered to prove a point and may include optimizations or techniques that would be uncommon in day to day operations. I think most everyone knows and agrees with that. In the interest of providing benchmark numbers that are not over engineered, Greenplum states the following [1]:

Greenplum and the EMC Data Computing Products Division are now producing real world benchmarks. No more obscure tests against formula-one tuning. Instead we would like to present the beginning of what we are calling real-world benchmarks. These benchmarks are designed to reflect the true customer experience and conform to the following guiding principles:

  • Test on the system as it leaves the factory, not the laboratory.
  • Create data types and schemas that match real-world use cases.
  • Consider options beyond raw bulk loading.

I think that list of good intentions is commendable, especially since I fondly remember EMC data sheets that had IOPS rates that were 100% from the array cache. Hopefully those days are behind them.

The Data Loading Rate Claim

As part of Greenplum’s data loading rate claims, there are two papers written up by Massive Data News that contain some details about these data loading benchmarks, one for Internet and Media [2], and one for Retail [3]. Below is the pertinent information.

Application Configuration

Configuration
Option (Internet and Media)
Option (Retail)
Segment servers 16 (standard GP1000 configuration) 16 (standard GP1000 configuration)
Table format Quicklz columnar Quicklz columnar
Mirrors Yes (two copies of the data) Yes (two copies of the data)
Gp_autostats_mode None None
ETL hosts 20, 2 URLs each 20, 2 URLs each
Rows loaded 320,000,000 320,000,000
Row width 616 bytes/row 666 bytes/row

The only difference between these two tables is the number/types of columns and the row width.

Benchmark Results

Metric
Results (Internet and Media)
Results (Retail)
Rows per Second 5,530,000 4,770,000
TB per Hour 11.16 10.4
Time to load 1 Billion rows 3.01 Minutes 3.48 Minutes

Derived Metrics

Metric
Value (Internet and Media)
Value (Retail)
Total Flat File Size 198 GB 214 GB
Data per ETL host 9.9 GB 10.7 GB
Time to load 320M rows 0.9632 Minutes 1.1136 Minutes

When I looked over these metrics the following stood out to me:

  • Extrapolation is used to report the time to load 1 billion rows (only 320 million are actually loaded).
  • Roughly 60 seconds of loading time is used to extrapolate an hourly loading rate.
  • The source file size per ETL host is very small; small enough to fit entirely in the file system cache.

Now why are these “red flags” to me for a “real-world” benchmark?

  • Extrapolation always shows linear rates. If Greenplum wants to present a real-world number to load 1 billion rows, then load at least 1 billion rows. It can’t be that hard, can it?
  • Extrapolation of the loading rate is at a factor of ~60x (extrapolating a 1 hour rate from 1 minute of execution). I’d be much more inclined to believe/trust a rate that was only extrapolated 2x or 4x, but 60x is way too much for me.
  • If the source files fit entirely into file system cache, no physical I/O needs to be done to stream that data out. It should be fairly obvious that no database can load data faster than the source system can deliver that data, but at least load more data than aggregate memory on the ETL nodes to eliminate the fully cached effect.
  • There are 20 ETL nodes feeding 16 Greenplum Segment nodes. Do real-world customers have more ETL nodes than database nodes? Maybe, maybe not.
  • No configuration is listed for the ETL nodes.

Now don’t get me wrong. I’m not challenging that EMC Greenplum Data Computing Appliance can’t do what is claimed. But surely the data that supports those claims has significant room for improvement, especially for a company that is claiming to be in favor of open and real-world benchmarks. Hopefully we see some better quality real-world benchmarks from these guys in the future.

The Most Impressive Metric

Loading rate aside, I found the most impressive metric was that EMC Greenplum can fit 18 servers in a rack that is just 7.5 inches tall (or is it 190 cm?) [4].

References
[1] http://www.greenplum.com/resources/data-loading-benchmarks/
[2] http://www.greenplum.com/pdf/gpdf/RealWorldBenchmarks_InternetMedia.pdf
[3] http://www.greenplum.com/pdf/gpdf/RealWorldBenchmarks_Retail.pdf
[4] http://www.greenplum.com/pdf/EMC-Greenplum_DCA_DataSheet.pdf

8 comments

  1. Pete Scott

    7.5 inches is an example of the extreme compression you will experience. But as 1.8M tall (which I thought to be 5ft11) I now feel somewhat small

  2. ikke

    I don’t see why these 2 points would make the benchmark worthless:

    # There are 20 ETL nodes feeding 16 Greenplum Segment nodes. Do real-world customers have more ETL nodes than database nodes? Maybe, maybe not.
    # No configuration is listed for the ETL nodes.

    It’s a benchmark for the DB, not for ETL. If the Db can handle the data coming from 100 ETLservers then that’s what you throw in to show the numbers.
    Benchmarks can never be completely real, because no database will only be handling this one load, not because you don’t have these number of ETL servers.

  3. Greg Rahn

    @ikke

    I never said or insinuated it made the benchmark worthless – it’s a matter of the definition and interpretation of what is being promoted as “real-world”.

  4. Mike Maxey

    Hi Greg,
    Mike Maxey here from The EMC Data Computing Division (Greenplum). Thanks for taking the time to evaluate and provide feedback on our real-world benchmark program. We just kicked this off and will continue to refine the details and provide updated results. A couple of comments on your suggestions.

    1. 320 million rows extrapolated to 1 Billion. Good call, we are re-running with 1 Billion
    2. Longer run times. We will continue to expand the test matrix.
    3. 20 ETL servers / no configuration details. The goal was to show database load performance and we wanted to be certain that the ETL function was in no-way a bottleneck.
    4. Finally, we did not squeeze 144TBs into 7 1/2 inches of rack space. Clearly a typo on the data sheet.

    Thanks again for the feedback. Our goal is transparency and numbers that everyone can believe and achieve.

    MwM

  5. Greg Rahn

    @Mike Maxey

    Thanks for the comments and obviously my comments on #4 was just some humor (190 cm is typical 42U racks). On the last page of the Load and Go paper it mentions that data generators for the benchmarks will be on greenplum.com. Can you point those out? I wasn’t able to spot them. Thanks!

  6. Jim Olsen

    Greg – would be happy to discuss the tests which were run in more detail – it certainly not the Oracle RWP Retail demo ;) This type of conjecture goes both ways. For example, Exadata doing 1,000,000 IOPS – has this been tested with a real world benchmark. I would say that the Exadata doc’s also provide much theoretical information.

  7. Greg Rahn

    @Jim Olsen

    Actually Jim, just ran a customer’s workload and hit over 1,000,000 IOPS last month. I posted a screen capture of it here. There was no special engineering done to hit that number and we were able to achieve what the Oracle Exadata Database Machine data sheet states (even a bit more actually), so there is no overstatement there. Also, you know just as well as I do, because you used to work for Oracle, that the Exadata scan rates are achievable with customer workloads. Not one of the numbers in the Exadata data sheet is theoretical — they have all be demonstrated to be achievable and have been demonstrated live at OOW for anyone to observe.

    Do you honestly feel it is an impressive show of that system to load a whopping 200GB using 192 cores? If the system can do 10TB/hour then take 6 minutes and load 1TB or 12 minutes and load 2TB. Should be easy and would certainly carry more weight than the current 200GB size. And also for the record Jim, if I saw such a “benchmark” being used for any Exadata numbers, I’d be critical of it also because there is no value in simply putting a magic number on a data sheet – it has to be attainable by customers also. That seems to be EMC’s desire also. They just need to do a better job with it and it certainly should not be difficult.

    Also, can you point me to the link for the data generators? They don’t seem to be on the site as the papers state. Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s