H2Hadoop: Improving Hadoop Performance Using the Metadata of Related Jobs

Cloud Computing leverages Hadoop framework for processing BigData in parallel. Hadoop has certain limitations that could be exploited to execute the job efficiently. These limitations are mostly because of data locality in the cluster, jobs and tasks scheduling, and resource allocations in Hadoop. Efficient resource allocation remains a challenge in Cloud Computing MapReduce platforms. We propose H2Hadoop, which is an enhanced Hadoop architecture that reduces the computation cost associated with BigData analysis. The proposed architecture also addresses the issue of resource allocation in native Hadoop. H2Hadoop provides a better solution for “text data”, such as finding DNA sequence and the motif of a DNA sequence. Also, H2Hadoop provides an efficient Data Mining approach for Cloud Computing environments. H2Hadoop architecture leverages on NameNode's ability to assign jobs to the TaskTrakers (DataNodes) within the cluster. By adding control features to the NameNode, H2Hadoop can intelligently direct and assign tasks to the DataNodes that contain the required data without sending the job to the whole cluster. Comparing with native Hadoop, H2Hadoop reduces CPU time, number of read operations, and another Hadoop factors.

[1]  G. Sudha Sadasivam,et al.  A novel approach to multiple sequence alignment using hadoop data grids , 2010, MDAC '10.

[2]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[3]  Mohammad Hammoud,et al.  Locality-Aware Reduce Task Scheduling for MapReduce , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[4]  Hassan Bajwa,et al.  Hadoop Based Enhanced Cloud Architecture , 2014 .

[5]  Nanying Yin,et al.  Analysis of the leaky bucket algorithm for on-off data sources , 1991, IEEE Global Telecommunications Conference GLOBECOM '91: Countdown to the New Millennium. Conference Record.

[6]  Aditya B. Patel,et al.  Addressing big data problem using Hadoop and Map Reduce , 2012, 2012 Nirma University International Conference on Engineering (NUiCONE).

[7]  Carlos Maltzahn,et al.  SciHadoop: Array-based query processing in Hadoop , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Long Zheng,et al.  ShmStreaming: A Shared Memory Approach for Improving Hadoop Streaming Performance , 2013, 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA).

[9]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[10]  Chitra Babu,et al.  CoHadoop++: A load balanced data co-location in Hadoop Distributed File System , 2013, 2013 Fifth International Conference on Advanced Computing (ICoAC).

[11]  Wei Luo,et al.  Analysis and Optimization of Data Import with Hadoop , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[12]  Junjie Chen,et al.  Blast-Parallel: The parallelizing implementation of sequence alignment algorithms based on Hadoop platform , 2013, 2013 6th International Conference on Biomedical Engineering and Informatics.

[13]  Nanying Yin,et al.  Analysis of the Leaky Bucket Algorithm for ON-OFF Data Sources , 1993, J. High Speed Networks.

[14]  G J Barton,et al.  Application of multiple sequence alignment profiles to improve protein secondary structure prediction , 2000, Proteins.

[15]  Daniel Gatica-Perez,et al.  A probabilistic approach to mining mobile phone data sequences , 2013, Personal and Ubiquitous Computing.

[16]  Jignesh M. Patel,et al.  Big data and its technical challenges , 2014, CACM.

[17]  Kai Wang,et al.  Accelerating MapReduce with Distributed Memory Cache , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[18]  Beng Chin Ooi,et al.  Query optimization for massively parallel data processing , 2011, SoCC.

[19]  Abdul Majid Mazlina,et al.  Big Data Processing in Cloud Computing Environments , 2017 .

[20]  Rong Gu,et al.  SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters , 2014, J. Parallel Distributed Comput..

[21]  Xubin He,et al.  Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[22]  NIDHI TIWARI,et al.  Classification Framework of MapReduce Scheduling Algorithms , 2015, ACM Comput. Surv..

[23]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[24]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[25]  Zhen Xiao,et al.  Improving MapReduce Performance Using Smart Speculative Execution Strategy , 2014, IEEE Transactions on Computers.

[26]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[27]  Matei Zaharia,et al.  Job Scheduling for Multi-User MapReduce Clusters , 2009 .

[28]  Herodotos Herodotou Hadoop Performance Models , 2011, ArXiv.

[29]  Xindong Wu,et al.  A Distributed Cache for Hadoop Distributed File System in Real-Time Cloud Services , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[30]  Shicong Meng,et al.  Improving ReduceTask data locality for sequential MapReduce jobs , 2013, 2013 Proceedings IEEE INFOCOM.

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32]  Ck Cheng,et al.  The Age of Big Data , 2015 .

[33]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[34]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[35]  Beng Chin Ooi,et al.  Distributed data management using MapReduce , 2014, CSUR.

[36]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[37]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[38]  M. N. Vora,et al.  Hadoop-HBase for large-scale data , 2011, Proceedings of 2011 International Conference on Computer Science and Network Technology.

[39]  Christian Bach,et al.  Use of Multi Threaded Asynchronous DNA Sequence Pattern Searching Tool to Identifying Zinc-Finger-Nuclease Binding Sites on the Human Genome , 2011, 2011 Eighth International Conference on Information Technology: New Generations.

[40]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[41]  Bo Hong,et al.  Bi-Hadoop: Extending Hadoop to Improve Support for Binary-Input Applications , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[42]  Rong Gu,et al.  Performance Optimization for Short MapReduce Job Execution in Hadoop , 2012, 2012 Second International Conference on Cloud and Green Computing.

[43]  Ling Liu,et al.  Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).