Facebook: Hive – A Petabyte Scale Data Warehouse Using Hadoop

Today, June 10th, marks the Yahoo! Hadoop Summit ’09 and the crew at Facebook have a writeup on the Facebook Engineering page entitled: Hive – A Petabyte Scale Data Warehouse Using Hadoop.

I found this an very interesting read given some of the Hadoop/MapReduce comments from David J. DeWitt and Michael Stonebraker as well as their SIGMOD 2009 paper, A Comparison of Approaches to Large-Scale Data Analysis. Now I’m not about to jump into this whole dbms-is-better-than-mapreduce argument but I found Facebook’s story line interesting:

When we started at Facebook in 2007 all of the data processing infrastructure was built around a data warehouse built using a commercial RDBMS. The data that we were generating was growing very fast – as an example we grew from a 15TB data set in 2007 to a 2PB data set today. The infrastructure at that time was so inadequate that some daily data processing jobs were taking more than a day to process and the situation was just getting worse with every passing day. We had an urgent need for infrastructure that could scale along with our data and it was at that time we then started exploring Hadoop as a way to address our scaling needs.

[The] Hive/Hadoop cluster at Facebook stores more than 2PB of uncompressed data and routinely loads 15 TB of data daily

Wow, 2PB of uncompressed data and growing at around 15TB daily. A part of me wonders how much value there is in 2PB of data or if companies are suffering from OCD when it comes to data. Either way it’s interesting to see how much data is being generated/collected and how engineers are dealing with it.

4 comments

  1. Noons

    Heck, we pump 0.3TB/day into our DW with a commercial db and 1/10000 of the budget of Facebook.
    I defy anyone to prove in simple terms, not market-speak, what value can be extracted and in what terms/what purpose from a 15TB/day DW!
    But I have no doubts it can be demonstrated: Stonebraker has been at the forefront of impossible projects for so long nothing surprises me anymore on that front…

  2. Chris Adkin

    I’ve been working with Oracle for over eleven years and I would not want to give the impression that I’m a RDBMS or Oracle zealot. However, as much as I appreciate one size does not always fit all, I would dispute the claim that what the Facebook have got is a data warehouse in the context of something that can be mined or allow value to be extracted from using BI tools. I would assume that what they have both lacks the tooling and constructs (things like mviews, bit mapping indexes etc . . ) to allow this be performed on such a large amount of data. However, if this does what want, is scalable and runs off cheap commodity hardware, you can’t knock it too much. What would be really interesting would be something that allows Petabyte scalability of web content and BI + data mining. It will be interesting to see what comes along with 11gR2. Within the constraints of not breaching any NDAs, people have told me that there is some interesting stuff coming along. Could this be the sort of stuff that will take Oracle more in the direction of what Facebook have done with hadoop . . . I guess we shall find out at Open World 2009.

  3. Pingback: savara » Blog Archive » Software za Facebookem

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s