Ripple: Improved Architecture and Programming Model for Bulk Synchronous Parallel Style of Analytics

We present Ripple, an architecture and a programming model for a broad set of data analytics. Ripple builds on the ideas of iterated MapReduce and adds two innovations. First it has a richer programming model, including more ideas from the Bulk Synchronous Parallel (BSP) model of computation and others. By doing so, Ripple creates a flexible and higher-level platform that is easier for both application programmers and platform implementors. Second, Ripple is based on a limited interface for key/value storage making it portable among many different key/value store implementations. By building on these two ideas Ripple improves the scope, performance, and openness of the data analytics platform. We evaluate Ripple using three representative, and non-trivial, data analysis scenarios requiring iterative computation. Using these examples, we show how Ripple achieves clear performance advantages over iterated MapReduce.

[1]  Jinyang Li,et al.  Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[2]  Shing-Tsaan Huang,et al.  Detecting termination of distributed computations by external agents , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[3]  Steven Hand,et al.  CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[4]  Jinyang Li,et al.  Building fast, distributed programs with partitioned tables , 2010 .

[5]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[6]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[7]  Torsten Suel,et al.  BSPlib: The BSP programming library , 1998, Parallel Comput..

[8]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[9]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[10]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[11]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[12]  David Cunningham,et al.  M3R: Increased performance for in-memory Hadoop jobs , 2012, Proc. VLDB Endow..

[13]  GhemawatSanjay,et al.  The Google file system , 2003 .

[14]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.