A Modified Key Partitioning for BigData Using MapReduce in Hadoop

In the period of BigData, massive amounts of structured and unstructured data are being created every day by a multitude of ever-present sources. BigData is complicated to work with and needs extremely parallel software executing on a huge number of computers. MapReduce is a current programming model that makes simpler writing distributed applications which manipulate BigData. In order to make MapReduce to work, it has to divide the workload between the computers in the network. As a result, the performance of MapReduce vigorously depends on how consistently it distributes this study load. This can be a challenge, particularly in the arrival of data skew. In MapReduce, workload allocation depends on the algorithm that partitions the data. How consistently the partitioner distributes the data depends on how huge and delegate the sample is and on how healthy the samples are examined by the partitioning method. This study recommends an enhanced partitioning algorithm using modified key partitioning that advances load balancing and memory utilization. This is completed via an enhanced sampling algorithm and partitioner. To estimate the proposed algorithm, its performance was compared against a high-tech partitioning mechanism employed by TeraSort. Experimentations demonstrate that the proposed algorithm is quicker, more memory efficient and more accurate than the existing implementation.

[1]  Madhusudhan Govindaraju,et al.  DELMA: Dynamically ELastic MapReduce Framework for CPU-Intensive Applications , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[2]  Heinz Stockinger,et al.  Grid Approach to Embarrassingly Parallel CPU-Intensive Bioinformatics Problems , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[3]  Ching-Hsien Hsu,et al.  Efficient selection strategies towards processor reordering techniques for improving data locality in heterogeneous clusters , 2010, The Journal of Supercomputing.

[4]  Mirek Riedewald,et al.  The model-summary problem and a solution for trees , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[5]  Masaru Kitsuregawa,et al.  Jumbo: Beyond MapReduce for Workload Balancing , 2010 .

[6]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[7]  GhemawatSanjay,et al.  The Google file system , 2003 .

[8]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[9]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[10]  Ching-Hsien Hsu,et al.  Scheduling contention-free irregular redistributions in parallelizing compilers , 2006, The Journal of Supercomputing.

[11]  Eleni Stroulia,et al.  Moving Text Analysis Tools to the Cloud , 2010, 2010 6th World Congress on Services.

[12]  Jong Wook Kim,et al.  RanKloud: Scalable Multimedia Data Processing in Server Clusters , 2011, IEEE MultiMedia.

[13]  Huan Liu,et al.  Cloud MapReduce: A MapReduce Implementation on Top of a Cloud Operating System , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[14]  Wei Jiang,et al.  Ex-MATE: Data Intensive Computing with Large Reduction Objects and Its Application to Graph Mining , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[15]  Isao Kojima,et al.  Dynamic Data Redistribution for MapReduce Joins , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[16]  Ching-Hsien Hsu,et al.  On improving resource utilization and system throughput of master slave job scheduling in heterogeneous systems , 2008, The Journal of Supercomputing.

[17]  Ching-Hsien Hsu,et al.  A Two-Level Scheduling Strategy for optimising communications of data parallel programs in clusters , 2010, Int. J. Ad Hoc Ubiquitous Comput..

[18]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[19]  Rajkumar Buyya,et al.  MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms , 2008, 2008 IEEE Fourth International Conference on eScience.

[20]  Arun Krishnan GridBLAST: a Globus‐based high‐throughput implementation of BLAST in a Grid computing framework , 2005, Concurr. Comput. Pract. Exp..

[21]  Xavier Llorà,et al.  Scaling Genetic Algorithms Using MapReduce , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[22]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[23]  Ching-Hsien Hsu,et al.  An improved partitioning mechanism for optimizing massive data analysis using MapReduce , 2013, The Journal of Supercomputing.

[24]  Ching-Hsien Hsu,et al.  Scheduling for atomic broadcast operation in heterogeneous networks with one port model , 2008, The Journal of Supercomputing.

[25]  Shantenu Jha,et al.  Programming Abstractions for Data Intensive Computing on Clouds and Grids , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[26]  Hugh E. Williams,et al.  Burst tries: a fast, efficient data structure for string keys , 2002, TOIS.

[27]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[28]  Rajeev Gandhi,et al.  Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop , 2009, HotCloud.