Improving MapReduce Performance by Data Prefetching in Heterogeneous or Shared Environments

MapReduce is an effective programming model for large-scale data-intensive computing applications. Hadoop, an open-source implementation of MapReduce, has been widely used. The communication overhead from the big data sets’ transmission affects the performance of Hadoop greatly. In consideration of data locality, Hadoop schedules tasks to the nodes near the data locations preferentially to decrease data transmission overhead, which works well in homogeneous and dedicated MapReduce environments. However, due to practical considerations about cost and resource utilization, it is common to maintain heterogeneous clusters or share resources by multiple users. Unfortunately, it’s difficult to take advantage of data locality in these heterogeneous or shared environments. To improve the performance of MapReduce in heterogeneous or shared environments, a data prefetching mechanism is proposed in this paper, which can fetch the data to corresponding compute nodes in advance. It is proved that the proposal of this paper reduces data transmission overhead effectively with theoretical analysis. The mechanism is implemented and evaluated on Hadoop-1.0.4. Experiment results on real applications show that the data prefetching mechanism can reduce data transmission time by up to 94%.

[1]  Cristina L. Abad,et al.  DARE: Adaptive Data Replication for Efficient Cluster Scheduling , 2011, 2011 IEEE International Conference on Cluster Computing.

[2]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[3]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[4]  Shengzhong Feng,et al.  Improving Data Locality of MapReduce by Scheduling in Homogeneous Computing Environments , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[5]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[6]  Xian-He Sun,et al.  ADAPT: Availability-Aware MapReduce Data Placement for Non-dedicated Distributed Computing , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[7]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[8]  Xuegong Zhang,et al.  Using class-center vectors to build support vector machines , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[11]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[12]  Sameh A. Salem,et al.  Mapreduce Performance in Heterogeneous Environments : A Review , 2013 .

[13]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[14]  Publisher Bioinfo Publications Journal of Grid and Distributed Computing , 2013 .

[15]  Kenli Li,et al.  MTSD: A Task Scheduling Algorithm for MapReduce Base on Deadline Constraints , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[16]  Jin-Soo Kim,et al.  HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.