MapReduce delay scheduling with deadline constraint

MapReduce programming paradigm has been widely applied to solve large‐scale data‐intensive problems. Intensive studies of MapReduce scheduling have been carried out to improve MapReduce system performance. Delay scheduling is a common way to achieve high data locality and system performance. However, inappropriate delays can lead to low system throughput and potentially break the original job priority constraints. This paper proposes a deadline‐enabled delay (DLD) scheduling algorithm that optimizes job delay decisions according to real‐time resource availability and resource competition, while still meets job deadline constraints. Experimental results illustrate that the resource availability estimation method of DLD is accurate (92%). Compared with other approaches, DLD reduces job turnaround time by 22% in average while keeping a high locality rate (88%).Copyright © 2013 John Wiley & Sons, Ltd.

[1]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[2]  Jeffrey Dean,et al.  Keynote talk: Experiences with MapReduce, an abstraction for large-scale computation , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Malgorzata Steinder,et al.  Performance-driven task co-scheduling for MapReduce environments , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[4]  Shengzhong Feng,et al.  Improving Data Locality of MapReduce by Scheduling in Homogeneous Computing Environments , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[5]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[6]  Jin-Soo Kim,et al.  HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[9]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[10]  Murali S. Kodialam,et al.  Scheduling in mapreduce-like systems for fast completion time , 2011, 2011 Proceedings IEEE INFOCOM.

[11]  Kemafor Anyanwu,et al.  Scheduling Hadoop Jobs to Meet Deadlines , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[12]  Geoffrey C. Fox,et al.  Investigation of Data Locality in MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[13]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[14]  GhemawatSanjay,et al.  The Google file system , 2003 .

[15]  Ladislav Hluchý,et al.  Towards Large Scale Semantic Annotation Built on MapReduce Architecture , 2008, ICCS.

[16]  Xiaohui Wei,et al.  Implement the Grid Workflow Scheduling for Data Intensive Applications with CSF4 , 2008, 2008 IEEE Fourth International Conference on eScience.

[17]  Judy Qiu,et al.  A hierarchical framework for cross-domain MapReduce execution , 2011, ECMLS '11.

[18]  Xiaohui Wei,et al.  Customized Plug-in Modules in Metascheduler CSF4 for Life Sciences Applications , 2007, New Generation Computing.

[19]  Xiaohui Wei,et al.  An Enhanced Data-aware Scheduling Algorithm for Batch-mode Dataintensive Jobs on Data Grid , 2006, 2006 International Conference on Hybrid Information Technology.