Joint scheduling of processing and Shuffle phases in MapReduce systems

MapReduce has emerged as an important paradigm for processing data in large data centers. MapReduce is a three phase algorithm comprising of Map, Shuffle and Reduce phases. Due to its widespread deployment, there have been several recent papers outlining practical schemes to improve the performance of MapReduce systems. All these efforts focus on one of the three phases to obtain performance improvement. In this paper, we consider the problem of jointly scheduling all three phases of the MapReduce process with a view of understanding the theoretical complexity of the joint scheduling and working towards practical heuristics for scheduling the tasks. We give guaranteed approximation algorithms and outline several heuristics to solve the joint scheduling problem.

[1]  Peter Brucker,et al.  Scheduling Algorithms , 1995 .

[2]  Rajiv Gandhi,et al.  Combinatorial Algorithms for Data Migration to Minimize Average Completion Time , 2006, Algorithmica.

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Jan Karel Lenstra,et al.  Complexity of machine scheduling problems , 1975 .

[5]  Rajiv Gandhi,et al.  Improved bounds for scheduling conflicting jobs with minsum criteria , 2008, TALG.

[6]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[7]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[8]  Guy Kortsarz,et al.  Sum Multicoloring of Graphs , 2000, J. Algorithms.

[9]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[10]  Mohammad Taghi Hajiaghayi,et al.  On a Local Protocol for Concurrent File Transfers , 2011, SPAA '11.

[11]  Jan Karel Lenstra,et al.  Complexity of Scheduling under Precedence Constraints , 1978, Oper. Res..

[12]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[13]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[14]  Maurice Queyranne,et al.  Approximation Bounds for a General Class of Precedence Constrained Parallel Machine Scheduling Problems , 1998, IPCO.

[15]  Murali S. Kodialam,et al.  Scheduling in mapreduce-like systems for fast completion time , 2011, 2011 Proceedings IEEE INFOCOM.

[16]  Yoo-Ah Kim,et al.  Data migration to minimize the total completion time , 2005, J. Algorithms.

[17]  Cynthia A. Phillips,et al.  Improved Scheduling Algorithms for Minsum Criteria , 1996, ICALP.