The programming paradigm Map-Reduce and its main open-source implementation, Hadoop, have had an enormous impact on large scale data processing. Our goal in this expository writeup is two-fold: first, we want to present some complexity measures that allow us to talk about Map-Reduce algorithms formally, and second, we want to point out why this model is actually different from other models of parallel programming, most notably the PRAM (Parallel Random Access Memory) model. We are looking for complexity measures that are detailed enough to make fine-grained distinction between different algorithms, but which also abstract away many of the implementation details.
[1]
Sergei Vassilvitskii,et al.
A model of computation for MapReduce
,
2010,
SODA '10.
[2]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[3]
Leslie G. Valiant,et al.
A bridging model for parallel computation
,
1990,
CACM.
[4]
Sergei Vassilvitskii,et al.
Counting triangles and the curse of the last reducer
,
2011,
WWW.
[5]
Herodotos Herodotou.
Hadoop Performance Models
,
2011,
ArXiv.
[6]
Ashish Goel,et al.
Dimension independent similarity computation
,
2012,
J. Mach. Learn. Res..