Serving large-scale batch computed data with project Voldemort

Current serving systems lack the ability to bulk load massive immutable data sets without affecting serving performance. The performance degradation is largely due to index creation and modification as CPU and memory resources are shared with request serving. We have extended Project Voldemort, a general-purpose distributed storage and serving system inspired by Amazon's Dynamo, to support bulk loading terabytes of read-only data. This extension constructs the index offline, by leveraging the fault tolerance and parallelism of Hadoop. Compared to MySQL, our compact storage format and data deployment pipeline scales to twice the request throughput while maintaining sub 5 ms median latency. At LinkedIn, the largest professional social network, this system has been running in production for more than 2 years and serves many of the data-intensive social features on the site.

[1]  Lada A. Adamic,et al.  Zipf's law and the Internet , 2002, Glottometrics.

[2]  Alon Itai,et al.  Interpolation search—a log logN search , 1978, CACM.

[3]  Ioannis Konstantinou,et al.  Distributed indexing of web scale datasets for the cloud , 2010, MDAC '10.

[4]  Gennaro Boggia,et al.  Parallel bulk insertion for large-scale analytics applications , 2010, LADIS '10.

[5]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[6]  Peter van Emde Boas,et al.  Design and implementation of an efficient priority queue , 1976, Mathematical systems theory.

[7]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[10]  Wenchao Zhou,et al.  A batch of PNUTS: experiences connecting cloud batch and serving systems , 2011, SIGMOD '11.

[11]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, HPTS.

[12]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[13]  Margo I. Seltzer,et al.  Berkeley DB , 1999, USENIX Annual Technical Conference, FREENIX Track.

[14]  Sasha Pachev Understanding Mysql Internals , 2007 .

[15]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[16]  Y. Manolopoulos,et al.  An adaptation of a root finding method to searching ordered disk files revisited , 1989 .

[17]  F. Warren Burton,et al.  Expected Complexity of Fast Search with Uniformly Distributed Data , 1981, Inf. Process. Lett..

[18]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[19]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[20]  M. Dowell,et al.  The “Pegasus” method for computing the root of an equation , 1972 .

[21]  Raghu Ramakrishnan,et al.  Efficient bulk insertion into a distributed ordered table , 2008, SIGMOD Conference.

[22]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[23]  Peter Mika Distributed indexing for semantic search , 2010, SEMSEARCH '10.