An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments

Data locality has recently been extensively exploited in Cloud computing to improve system performance. However, when schedule Map tasks in Hadoop MapReduce framework working in a heterogeneous environment, existing methods either cannot reduce the occurrence of these Map tasks or injure fairness, thus degrading the system performance. In order to address this problem, this paper proposes a data locality aware scheduling method to improve the Hadoop MapReduce system performance in heterogeneous computing environments. After receiving a request from a requesting node, our method preferentially schedules the task whose input data is stored on the requesting node. If no such tasks exist, our method will select the task whose input data is nearest to the requesting node, and then make a decision on whether to reserve the task for the node storing the input data or schedule the task to the requesting node by transferring the input data to the requesting node on the fly. As a proof of concept, we implement the method in Hadoop-0.20.2. In order to evaluate the performance, we carry out an experimental comparison study on our proposed method against the default scheduling method used in Hadoop-0.20.2. The experiment results show that our proposed method improves the data locality and reduces the normalized execution time as well as the response time of jobs.

[1]  Malgorzata Steinder,et al.  Performance-driven task co-scheduling for MapReduce environments , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[2]  Shengzhong Feng,et al.  Improving Data Locality of MapReduce by Scheduling in Homogeneous Computing Environments , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[3]  Thomas Sandholm,et al.  Dynamic Proportional Share Scheduling in Hadoop , 2010, JSSPP.

[4]  Chao Tian,et al.  A Dynamic MapReduce Scheduler for Heterogeneous Workloads , 2009, 2009 Eighth International Conference on Grid and Cooperative Computing.

[5]  GhemawatSanjay,et al.  The Google file system , 2003 .

[6]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[7]  Kemafor Anyanwu,et al.  Scheduling Hadoop Jobs to Meet Deadlines , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[8]  Radu Sion,et al.  Enhancement of Xen's scheduler for MapReduce workloads , 2011, HPDC '11.

[9]  Kavitha Ranganathan,et al.  Evolving Toward the Perfect Schedule: Co-scheduling Job Assignments and Data Replication in Wide-Area Systems Using a Genetic Algorithm , 2005, JSSPP.

[10]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[11]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Jin-Soo Kim,et al.  HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[14]  Vasudeva Varma,et al.  Using Pattern Classification for Task Assignment in MapReduce , 2009 .