Coded MapReduce

MapReduce is a commonly used framework for executing data-intensive tasks on distributed server clusters. We present “Coded MapReduce”, a new framework that enables and exploits a particular form of coding to significantly reduce the inter-server communication load of MapReduce. In particular, Coded MapReduce exploits the repetitive mapping of data blocks at different servers to create coded multicasting opportunities in the shuffling phase, cutting down the total communication load by a multiplicative factor that grows linearly with the number of servers in the cluster. We also analyze the tradeoff between the “computation load” and the “communication load” of the Coded MapReduce.

[1]  Leonard Kleinrock,et al.  Time-shared Systems: a theoretical treatment , 1967, JACM.

[2]  Edward G. Coffman,et al.  Waiting Time Distributions for Processor-Sharing Systems , 1970, JACM.

[3]  John G. Proakis,et al.  Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[4]  D. Farnsworth A First Course in Order Statistics , 1993 .

[5]  B. Arnold,et al.  A first course in order statistics , 1994 .

[6]  Rudolf Ahlswede,et al.  Network information flow , 2000, IEEE Trans. Inf. Theory.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Yitzhak Birk,et al.  Coding on demand by an informed source (ISCOD) for efficient broadcast of different supplemental data to caching clients , 2006, IEEE Transactions on Information Theory.

[9]  Kai Wang,et al.  Accelerating MapReduce with Distributed Memory Cache , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[10]  Albert G. Greenberg,et al.  VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[11]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[12]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[13]  Amin Vahdat,et al.  Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[14]  Ziv Bar-Yossef,et al.  Index Coding With Side Information , 2006, IEEE Transactions on Information Theory.

[15]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[16]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[17]  Urs Niesen,et al.  Fundamental limits of caching , 2012, 2013 IEEE International Symposium on Information Theory.

[18]  Urs Niesen,et al.  Decentralized coded caching attains order-optimal memory-rate tradeoff , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[19]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[20]  Suhas N. Diggavi,et al.  Hierarchical coded caching , 2014, 2014 IEEE International Symposium on Information Theory.

[21]  Giuseppe Caire,et al.  Fundamental Limits of Caching in Wireless D2D Networks , 2014, IEEE Transactions on Information Theory.