论文信息 - Ripple: Improved Architecture and Programming Model for Bulk Synchronous Parallel Style of Analytics

Ripple: Improved Architecture and Programming Model for Bulk Synchronous Parallel Style of Analytics

We present Ripple, an architecture and a programming model for a broad set of data analytics. Ripple builds on the ideas of iterated MapReduce and adds two innovations. First it has a richer programming model, including more ideas from the Bulk Synchronous Parallel (BSP) model of computation and others. By doing so, Ripple creates a flexible and higher-level platform that is easier for both application programmers and platform implementors. Second, Ripple is based on a limited interface for key/value storage making it portable among many different key/value store implementations. By building on these two ideas Ripple improves the scope, performance, and openness of the data analytics platform. We evaluate Ripple using three representative, and non-trivial, data analysis scenarios requiring iterative computation. Using these examples, we show how Ripple achieves clear performance advantages over iterated MapReduce.

[1] Jinyang Li,et al. Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[2] Shing-Tsaan Huang,et al. Detecting termination of distributed computations by external agents , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[3] Steven Hand,et al. CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[4] Jinyang Li,et al. Building fast, distributed programs with partitioned tables , 2010 .

[5] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[6] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[7] Torsten Suel,et al. BSPlib: The BSP programming library , 1998, Parallel Comput..

[8] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[9] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[10] Howard Gobioff,et al. The Google file system , 2003, SOSP '03.

[11] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[12] David Cunningham,et al. M3R: Increased performance for in-memory Hadoop jobs , 2012, Proc. VLDB Endow..

[13] GhemawatSanjay,et al. The Google file system , 2003 .

[14] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[15] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.