Open data challenges at Facebook

At Facebook, our data systems process huge volumes of data, ranging from hundreds of terabytes in memory to hundreds of petabytes on disk. We categorize our systems as “small data” or “big data” based on the type of queries they run. Small data refers to OLTP-like queries that process and retrieve a small amount of data, for example, the 1000s of objects necessary to render Facebook's personalized News Feed for each person. These objects are requested by their ids; indexes limit the amount of data accessed during a single query, regardless of the total volume of data. Big data refers to queries that process large amounts of data, usually for analysis: trouble-shooting, identifying trends, and making decisions. Big data stores are the workhorses for data analysis at Facebook. They grow by millions of events (inserts) per second and process tens of petabytes and hundreds of thousands of queries per day. In this tutorial, we will describe our data systems and the current challenges we face. We will lead a discussion on these challenges, approaches to solve them, and potential pitfalls. We hope to stimulate interest in solving these problems in the research community.

[1]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[2]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[3]  Karthik Ranganathan,et al.  Apache hadoop goes realtime at Facebook , 2011, SIGMOD '11.

[4]  John Allen,et al.  Scuba: Diving into Data at Facebook , 2013, Proc. VLDB Endow..

[5]  Komal Shringare,et al.  Apache Hadoop Goes Realtime at Facebook , 2015 .

[6]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[7]  Janet L. Wiener,et al.  Fast database restarts at facebook , 2014, SIGMOD Conference.