Load Balancing through Block Rearrangement Policy for Hadoop Heterogeneous Cluster

To store and analyze Big Data, Hadoop is the most common tool for the researchers and scientists. The storage of huge amount of data in Hadoop is done using Hadoop Distributed File System (HDFS). HDFS uses block placement policy to split a very large file into blocks and place them across the cluster in a distributed manner. Basically, Hadoop and HDFS have been designed in such a way that it works efficiently on the homogeneous cluster. But in this era of networking, we cannot imagine having a cluster of homogeneous nodes only. So, there is the need of storage policy that can work efficiently on both homogeneous as well as the heterogeneous cluster. Thus, the needs of applications that can be executed time-efficiently based on homogeneous as well as the heterogeneous environment can be sufficed. Data locality in Hadoop maps the data block to process in the same node, but often when you're dealing with Big Data, it is required to map the data block to the processes across multiple nodes. To deal with this Hadoop has functionality to copy the data block where mappers are running. This creates a lot of performance degradation especially on heterogeneous cluster due to I/O delay or network congestions. Here we present a Novel algorithm to balance the data blocks on specific nodes (i.e. custom block placement) only by dividing total nodes among two categories like: homogeneous vs. heterogeneous or high performing nodes vs. low performing nodes. This policy helps to achieve better load rearrangement among the nodes and we can put data blocks actually where we want our data to be placed for the processing.