A comparative review of job scheduling for MapReduce

MapReduce is an emerging paradigm for data intensive processing with support of cloud computing technology. MapReduce provides convenient programming interfaces to distribute data intensive works in a cluster environment. The strengths of MapReduce are fault tolerance, an easy programming structure and high scalability. A variety of applications have adopted MapReduce including scientific analysis, web data processing and high performance computing. Data Intensive computing systems, such as Hadoop and Dryad, should provide an efficient scheduling mechanism for enhanced utilization in a shared cluster environment. The problems of scheduling map-reduce jobs are mostly caused by locality and synchronization overhead. Also, there is a need to schedule multiple jobs in a shared cluster with fairness constraints. By introducing the scheduling problems with regards to locality, synchronization and fairness constraints, this paper reviews a collection of scheduling methods for handling these issues in MapReduce. In addition, this paper compares different scheduling methods evaluating their features, strengths and weaknesses. For resolving synchronization overhead, two categories of studies; asynchronous processing and speculative execution are discussed. For fairness constraints with locality improvement, delay scheduling in Hadoop and Quincy scheduler in Dryad are discussed.

[1]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[2]  Wu-chun Feng,et al.  Enhancing MapReduce via Asynchronous Data Processing , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[3]  Robert L. Grossman,et al.  Sector and Sphere: the design and implementation of a high-performance data cloud , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[4]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[5]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, HPTS.

[6]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[7]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[8]  S. Habib,et al.  Introducing map-reduce to high end computing , 2008, 2008 3rd Petascale Data Storage Workshop.

[9]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[10]  Jianwu Wang,et al.  Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems , 2009, WORKS '09.

[11]  Suresh Jagannathan,et al.  Asynchronous Algorithms in MapReduce , 2010, 2010 IEEE International Conference on Cluster Computing.

[12]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[13]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.