Improving Data Locality of MapReduce by Scheduling in Homogeneous Computing Environments

Data Locality is one of the critical factors to affect performance. This paper proposes a next-k-node scheduling (NKS) method to improve the data locality of map tasks. The method first calculates the probabilities of each map task, and then preferentially schedules the one with the highest probability. It generates low probabilities for the tasks which satisfy node locality with the nodes to issue requests, so it can reserve these tasks to these nodes. We have implemented the NKS method in hadoop-0.20.2. The experiment results have shown that the NKS method reduced 78% of the map tasks processed without node locality, reduced 77%of the network load caused by the tasks, and improved the performance of Hadoop MapReduce when comparing with the default task scheduling method in Hadoop. Obviously, the NKS method is very suitable for the homogeneous environment with network overload.

[1]  Gianluigi Zanetti,et al.  Biodoop: Bioinformatics on Hadoop , 2009, 2009 International Conference on Parallel Processing Workshops.

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Kai Wang,et al.  Spatial Queries Evaluation with MapReduce , 2009, 2009 Eighth International Conference on Grid and Cooperative Computing.

[4]  Jin-Soo Kim,et al.  HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[5]  Quan Chen,et al.  SAMR: A Self-adaptive MapReduce Scheduling Algorithm in Heterogeneous Environment , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[6]  Thomas Sandholm,et al.  Dynamic Proportional Share Scheduling in Hadoop , 2010, JSSPP.

[7]  Surendra Byna,et al.  A Taxonomy of Data Prefetching Mechanisms , 2008, 2008 International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008).

[8]  Insup Lee,et al.  Real-Time MapReduce Scheduling , 2010 .

[9]  Stefan Wrobel,et al.  Toolkit-Based High-Performance Data Mining of Large Data on MapReduce Clusters , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[10]  Matei Zaharia,et al.  Job Scheduling for Multi-User MapReduce Clusters , 2009 .