DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications With Interest Locality

Recent years have seen an increasing number of scientists employ data parallel computing frameworks such as MapReduce and Hadoop to run data intensive applications and conduct analysis. In these co-located compute and storage frameworks, a wise data placement scheme can significantly improve the performance. Existing data parallel frameworks, e.g., Hadoop, or Hadoop-based clouds, distribute the data using a random placement method for simplicity and load balance. However, we observe that many data intensive applications exhibit interest locality which only sweep part of a big data set. The data often accessed together result from their grouping semantics. Without taking data grouping into consideration, the random placement does not perform well and is way below the efficiency of optimal data distribution. In this paper, we develop a new Data-gRouping-AWare (DRAW) data placement scheme to address the above-mentioned problem. DRAW dynamically scrutinizes data access from system log files. It extracts optimal data groupings and re-organizes data layouts to achieve the maximum parallelism per group subjective to load balance. By experimenting two real-world MapReduce applications with different data placement schemes on a 40-node test bed, we conclude that DRAW increases the total number of local map tasks executed up to 59.8%, reduces the completion latency of the map phase up to 41.7%, and improves the overall performance by 36.4%, in comparison with Hadoop's default random placement.

[1]  Shivam Tripathi,et al.  Change detection in rainfall and temperature patterns over India , 2009, SensorKDD '09.

[2]  Jun Wang,et al.  MRAP: a novel MapReduce-based framework to support HPC analytics applications with access patterns , 2010, HPDC '10.

[3]  M. Frans Kaashoek,et al.  Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files , 1997, USENIX Annual Technical Conference.

[4]  Randal C. Burns,et al.  Group-based management of distributed file caches , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[5]  Manuel Rodríguez-Martínez,et al.  Open Source Cloud Computing Tools: A Case Study with a Weather Application , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[6]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..

[7]  Renaud Lebrun,et al.  Visualizing shape transformation between chimpanzee and human braincases , 2007, The Visual Computer.

[8]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[9]  Byungkook Lee,et al.  Identification of nine human-specific frameshift mutations by comparative analysis of the human and the chimpanzee genome sequences , 2005, ISMB.

[10]  Vasa Curcin,et al.  Achievements and Experiences from a Grid-Based Earthquake Analysis and Modelling Study , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[11]  Kang Zhang,et al.  Deriving program physical structures using bond energy algorithm , 1999, Proceedings Sixth Asia Pacific Software Engineering Conference (ASPEC'99) (Cat. No.PR00509).

[12]  Anna Dumitriu X and Y (number 5) , 2004, SIGGRAPH '04.

[13]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[14]  Feng Liu,et al.  An Efficient Cloud Computing-Based Architecture for Freight System Application in China Railway , 2009, CloudCom.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Geoffrey H. Kuenning,et al.  Automated hoarding for mobile computers , 1997, SOSP.