Novel Data-Distribution Technique for Hadoop in Heterogeneous Cloud Environments

The Hadoop framework has been developed to effectively process data-intensive MapReduce applications. Hadoop users specify the application computation logic in terms of a map and a reduce function, which are often termed MapReduce applications. The Hadoop distributed file system is used to store the MapReduce application data on the Hadoop cluster nodes called Data nodes, whereas Name node is a control point for all Data nodes. While its resilience is increased, its current data-distribution methodologies are not necessarily efficient for heterogeneous distributed environments such as public clouds. This work contends that existing data distribution techniques are not necessarily suitable, since the performance of Hadoop typically degrades in heterogeneous environments whenever data-distribution is not determined as per the computing capability of the nodes. The concept of data-locality and its impact on the performance of Hadoop are key factors, since they affect the performance in the Map phase when scheduling tasks. The task scheduling techniques in Hadoop should arguably consider data locality to enhance performance. Various task scheduling techniques have been analysed to understand their data-locality awareness while scheduling applications. Other system factors also play a major role while achieving high performance in Hadoop data processing. The main contribution of this work is a novel methodology for data placement for Hadoop Data nodes based on their computing ratio. Two standard MapReduce applications, Word Count and Grep, have been executed and a significant performance improvement has been observed based on our proposed data distribution technique.

[1]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[2]  Fang Dong,et al.  BAR: An Efficient Data Locality Driven Task Scheduling Algorithm for Cloud Computing , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[3]  A. Kala Karun,et al.  A review on hadoop — HDFS infrastructure extensions , 2013, 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES.

[4]  Miron Livny,et al.  Data placement for scientific applications in distributed environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[5]  Xu Zhao,et al.  A Heterogeneity-aware Data Distribution and Rebalance Method in Hadoop Cluster , 2012, 2012 Seventh ChinaGrid Annual Conference.

[6]  Yuhong Feng,et al.  An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments , 2011, 2011 International Conference on Cloud and Service Computing.

[7]  Peng Xu,et al.  A Novel Blocks Placement Strategy for Hadoop , 2012, 2012 IEEE/ACIS 11th International Conference on Computer and Information Science.

[8]  Dana Petcu,et al.  Next Generation HPC Clouds: A View for Large-Scale Scientific and Data-Intensive Applications , 2014, Euro-Par Workshops.

[9]  Rajashekhar M. Arasanal,et al.  Improving MapReduce Performance through Complexity and Performance Based Data Placement in Heterogeneous Hadoop Clusters , 2013, ICDCIT.

[10]  Horacio González-Vélez,et al.  Performance evaluation of MapReduce using full virtualisation on a departmental cloud , 2011, Int. J. Appl. Math. Comput. Sci..

[11]  Peter Kilpatrick,et al.  The ParaPhrase Project: Parallel Patterns for Adaptive Heterogeneous Multicore Systems , 2011, FMCO.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Zhenhua Guo,et al.  Investigation of data locality and fairness in MapReduce , 2012, MapReduce '12.

[14]  Fan Chung Graham,et al.  Maximizing data locality in distributed systems , 2006, J. Comput. Syst. Sci..

[15]  Phuong Nguyen,et al.  A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment , 2012, 2012 IEEE Fifth International Conference on Utility and Cloud Computing.

[16]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[17]  Haiying Shen,et al.  A proactive low-overhead file replication scheme for structured P2P content delivery networks , 2009, J. Parallel Distributed Comput..

[18]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.