In recent years the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Used daily at companies such as Yahoo!, Google, Amazon, and Facebook, and adopted more recently by several universities, it allows for easy parallelization of data intensive computations over many machines. One key feature of MapReduce that differentiates it from previous models of parallel computation is that it interleaves sequential and parallel computation. We propose a model of efficient computation using the MapReduce paradigm. Since MapReduce is designed for computations over massive data sets, our model limits the number of machines and the memory per machine to be substantially sublinear in the size of the input. On the other hand, we place very loose restrictions on the computational power of of any individual machine---our model allows each machine to perform sequential computations in time polynomial in the size of the original input.
We compare MapReduce to the PRAM model of computation. We prove a simulation lemma showing that a large class of PRAM algorithms can be efficiently simulated via MapReduce. The strength of MapReduce, however, lies in the fact that it uses both sequential and parallel computation. We demonstrate how algorithms can take advantage of this fact to compute an MST of a dense graph in only two rounds, as opposed to Ω(log(n)) rounds needed in the standard PRAM model. We show how to evaluate a wide class of functions using the MapReduce framework. We conclude by applying this result to show how to compute some basic algorithmic problems such as undirected s-t connectivity in the MapReduce framework.
[1]
Ronald L. Graham,et al.
Bounds on multiprocessing anomalies and related packing algorithms
,
1972,
AFIPS '72 (Spring).
[2]
Leslie G. Valiant,et al.
A bridging model for parallel computation
,
1990,
CACM.
[3]
F. Leighton,et al.
Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes
,
1991
.
[4]
Ramesh Subramonian,et al.
LogP: towards a realistic model of parallel computation
,
1993,
PPOPP '93.
[5]
ComputationDuncan,et al.
A Survey of Models of Parallel
,
1997
.
[6]
Gabriel Loh.
A Critical Assessment of LogP : Towards a Realistic Model of Parallel Computation
,
2000
.
[7]
Vijay V. Vazirani,et al.
Approximation Algorithms
,
2001,
Springer Berlin Heidelberg.
[8]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[9]
Abhinandan Das,et al.
Google news personalization: scalable online collaborative filtering
,
2007,
WWW '07.
[10]
Jon Feldman,et al.
On distributing symmetric streaming computations
,
2008,
SODA '08.
[11]
Charalampos E. Tsourakakis,et al.
HADI : Fast Diameter Estimation and Mining in Massive Graphs with Hadoop
,
2008
.
[12]
Christos Faloutsos,et al.
DOULION: counting triangles in massive graphs with a coin
,
2009,
KDD.