Load balancing in MapReduce environments for data intensive applications

The distributed computations are widely used in the modern world for processing large scale jobs. Hadoop framework which is based on Google MapReduce model becomes popular due to its great processing power and ease to use. However, due to lack of load management, especially in a heterogeneous computing environment, the performance of Hadoop framework may be deteriorated. Therefore this paper presents a load balancing algorithm which aims to balance the load among heterogeneous nodes. And also, the Hadoop simulator HSim is involved to evaluate the performance of the load balancing algorithm. The results indicate that the performance of the cluster has been significantly enhanced.

[1]  GhemawatSanjay,et al.  The Google file system , 2003 .

[2]  Mohamed Othman,et al.  Survey on Divisible Load Theory and its Applications , 2009, 2009 International Conference on Information Management and Engineering.

[3]  Maozhen Li,et al.  A MapReduce based distributed LSI , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[4]  G. Sudha Sadasivam,et al.  A novel parallel hybrid PSO-GA using MapReduce to schedule jobs in Hadoop data grids , 2010, 2010 Second World Congress on Nature and Biologically Inspired Computing (NaBIC).

[5]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[6]  Masaru Kitsuregawa,et al.  Jumbo: Beyond MapReduce for Workload Balancing , 2010 .

[7]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[8]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[11]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[12]  Masato Oguchi,et al.  Run-Time Load Balancing System on SAN-connected PC Cluster for Dynamic Injection of CPU and Disk Resource - A Case Study of Data Mining Application , 2002, DEXA.

[13]  Thomas G. Robertazzi,et al.  Ten Reasons to Use Divisible Load Theory , 2003, Computer.