Hadoop Performance Acceleration by Effective Data and Job Placement

In order to accelerate Hadoop performance, it is important to efficiently handle the data and job placement. More specifically, we focus on to accelerate the performance of heterogeneous distributed cluster as Hadoop default has limited performance outcome for data-intensive jobs. To improve the Hadoop performance, it is important to consider the heterogeneity of nodes, reduce job latency, and improve the data locality of blocks. In this research, we use block rearrangement policy which can rearrange the data blocks considering node’s processing capability or heterogeneity of node for data placement and effectively use node labeling and scheduling schemes for job placement to meet the goal. The experimental result shows that the proposed model accelerates the Hadoop performance by achieving high data locality and less job completion time compared to default configuration and policy.

[1]  Yookun Cho,et al.  An efficient Hadoop data replication method design for heterogeneous clusters , 2016, SAC.

[2]  Wei Dai,et al.  An Improved Replica Placement Policy for Hadoop Distributed File System Running on Cloud Platforms , 2017, 2017 IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud).

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Ankit Shah,et al.  Performance Analysis of Scheduling Algorithms in Apache Hadoop , 2019 .

[5]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[6]  Ankit Shah,et al.  Load Balancing through Block Rearrangement Policy for Hadoop Heterogeneous Cluster , 2018, 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[7]  Dong Ryeol Shin,et al.  RDP: A storage-tier-aware Robust Data Placement strategy for Hadoop in a Cloud-based Heterogeneous Environment , 2016, KSII Trans. Internet Inf. Syst..

[8]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[9]  Yang Ding,et al.  A Network Load Sensitive Block Placement Strategy of HDFS , 2015, KSII Trans. Internet Inf. Syst..

[10]  Iman Elghandour,et al.  CoS-HDFS: Co-Locating Geo-Distributed Spatial Data in Hadoop Distributed File System , 2016, 2016 IEEE/ACM 3rd International Conference on Big Data Computing Applications and Technologies (BDCAT).

[11]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.