To store and analyze Big Data, Hadoop is the most common tool for the researchers and scientists. The storage of huge amount of data in Hadoop is done using Hadoop Distributed File System (HDFS). HDFS uses block placement policy to split a very large file into blocks and place them across the cluster in a distributed manner. Basically, Hadoop and HDFS have been designed in such a way that it works efficiently on the homogeneous cluster. But in this era of networking, we cannot imagine having a cluster of homogeneous nodes only. So, there is the need of storage policy that can work efficiently on both homogeneous as well as the heterogeneous cluster. Thus, the needs of applications that can be executed time-efficiently based on homogeneous as well as the heterogeneous environment can be sufficed. Data locality in Hadoop maps the data block to process in the same node, but often when you're dealing with Big Data, it is required to map the data block to the processes across multiple nodes. To deal with this Hadoop has functionality to copy the data block where mappers are running. This creates a lot of performance degradation especially on heterogeneous cluster due to I/O delay or network congestions. Here we present a Novel algorithm to balance the data blocks on specific nodes (i.e. custom block placement) only by dividing total nodes among two categories like: homogeneous vs. heterogeneous or high performing nodes vs. low performing nodes. This policy helps to achieve better load rearrangement among the nodes and we can put data blocks actually where we want our data to be placed for the processing.
[1]
M. Tech,et al.
Load Rebalancing for Distributed File Systems in Clouds
,
2015
.
[2]
Xiaodong Liu,et al.
A speculative approach to spatial-temporal efficiency with multi-objective optimization in a heterogeneous cloud environment
,
2016,
Secur. Commun. Networks.
[3]
Vasudeva Varma,et al.
Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework
,
2012,
Future Gener. Comput. Syst..
[4]
Carlo Curino,et al.
Apache Hadoop YARN: yet another resource negotiator
,
2013,
SoCC.
[5]
Hairong Kuang,et al.
The Hadoop Distributed File System
,
2010,
2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).
[6]
Yun Tian,et al.
Improving MapReduce performance through data placement in heterogeneous Hadoop clusters
,
2010,
2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).
[7]
Yang Xiang,et al.
A sliding window‐based dynamic load balancing for heterogeneous Hadoop clusters
,
2017,
Concurr. Comput. Pract. Exp..
[8]
D. Janaki Ram,et al.
Tula: A disk latency aware balancing and block placement strategy for Hadoop
,
2017,
2017 IEEE International Conference on Big Data (Big Data).
[9]
Ankit Shah,et al.
Performance Analysis of Scheduling Algorithms in Apache Hadoop
,
2019
.
[10]
Randy H. Katz,et al.
Improving MapReduce Performance in Heterogeneous Environments
,
2008,
OSDI.