Two Sides of a Coin: Optimizing the Schedule of MapReduce Jobs to Minimize Their Makespan and Improve Cluster Performance

Large-scale MapReduce clusters that routinely process petabytes of unstructured and semi-structured data represent a new entity in the changing landscape of clouds. A key challenge is to increase the utilization of these MapReduce clusters. In this work, we consider a subset of the production workload that consists of MapReduce jobs with no dependencies. We observe that the order in which these jobs are executed can have a significant impact on their overall completion time and the cluster resource utilization. Our goal is to automate the design of a job schedule that minimizes the completion time (makespan) of such a set of MapReduce jobs. We offer a novel abstraction framework and a heuristic, called BalancedPools, that efficiently utilizes performance properties of MapReduce jobs in a given workload for constructing an optimized job schedule. Simulations performed over a realistic workload demonstrate that 15%-38% makespan improvements are achievable by simply processing the jobs in the right order.

[1]  Xiaodan Wang,et al.  CoScan: cooperative scan sharing in the cloud , 2011, SoCC.

[2]  Beng Chin Ooi,et al.  Query optimization for massively parallel data processing , 2011, SoCC.

[3]  Roy H. Campbell,et al.  Play It Again, SimMR! , 2011, 2011 IEEE International Conference on Cluster Computing.

[4]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[5]  Fusheng Wang,et al.  YSmart: Yet Another SQL-to-MapReduce Translator , 2011, 2011 31st International Conference on Distributed Computing Systems.

[6]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[7]  Anirban Dasgupta,et al.  On scheduling in map-reduce and flow-shops , 2011, SPAA '11.

[8]  Kun-Lung Wu,et al.  FLEX: A Slot Allocation Scheduling Optimizer for MapReduce Workloads , 2010, Middleware.

[9]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[10]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[11]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[12]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[13]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[14]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[15]  Zahir Tari,et al.  Dynamic task assignment in server farms: better performance by task grouping , 2002, Proceedings ISCC 2002 Seventh International Symposium on Computers and Communications.

[16]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1999, JACM.

[17]  M. A. Bender,et al.  An Efficient Approximation Algorithm for Minimizing Makespan on Uniformly Related Machines , 1998, IPCO.

[18]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[19]  Jacek Blazewicz,et al.  Scheduling in Computer and Manufacturing Systems , 1990 .

[20]  K. Mani Chandy,et al.  A comparison of list schedules for parallel processing systems , 1974, Commun. ACM.

[21]  Butler W. Lampson,et al.  A scheduling philosophy for multiprocessing systems , 1968, CACM.

[22]  Ronald L. Graham,et al.  Bounds for certain multiprocessing anomalies , 1966 .

[23]  S. M. Johnson,et al.  Optimal two- and three-stage production schedules with setup times included , 1954 .

[24]  Jacques Carlier,et al.  Handbook of Scheduling - Algorithms, Models, and Performance Analysis , 2004 .

[25]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .