Parallel MapReduce: Maximizing Cloud Resource Utilization and Performance Improvement Using Parallel Execution Strategies

MapReduce is the preferred cloud computing framework used in large data analysis and application processing. MapReduce frameworks currently in place suffer performance degradation due to the adoption of sequential processing approaches with little modification and thus exhibit underutilization of cloud resources. To overcome this drawback and reduce costs, we introduce a Parallel MapReduce (PMR) framework in this paper. We design a novel parallel execution strategy of Map and Reduce worker nodes. Our strategy enables further performance improvement and efficient utilization of cloud resources execution of Map and Reduce functions to utilize multicore environments available with computing nodes. We explain in detail makespan modeling and working principle of the PMR framework in the paper. Performance of PMR is compared with Hadoop through experiments considering three biomedical applications. Experiments conducted for BLAST, CAP3, and DeepBind biomedical applications report makespan time reduction of 38.92%, 18.00%, and 34.62% considering the PMR framework against Hadoop framework. Experiments' results prove that the PMR cloud computing platform proposed is robust, cost-effective, and scalable, which sufficiently supports diverse applications on public and private cloud platforms. Consequently, overall presentation and results indicate that there is good matching between theoretical makespan modeling presented and experimental values investigated.

[1]  Boon Thau Loo,et al.  Optimizing cost and performance trade-offs for MapReduce job processing in the cloud , 2014, 2014 IEEE Network Operations and Management Symposium (NOMS).

[2]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[3]  Bowen Zhou,et al.  Orion: Scaling Genomic Sequence Matching with Fine-Grained Parallelization , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[5]  Ewa Deelman,et al.  The cost of doing science on the cloud: the Montage example , 2008, HiPC 2008.

[6]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[7]  Ahmed Abdulhakim Al-Absi,et al.  Long Read Alignment with Parallel MapReduce Cloud Platform , 2015, BioMed research international.

[8]  Radhe Shyam Thakur,et al.  Now and Next-Generation Sequencing Techniques: Future of Sequence Analysis Using Cloud Computing , 2012, Front. Gene..

[9]  Weimin Zheng,et al.  NO2: Speeding up Parallel Processing of Massive Compute-Intensive Tasks , 2014, IEEE Transactions on Computers.

[10]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Jeffrey T. Leek,et al.  Cloud-scale RNA-sequencing differential , 2010 .

[13]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[14]  Arnon Rosenthal,et al.  Methodological Review: Cloud computing: A new business paradigm for biomedical information sharing , 2010 .

[15]  Anna Cinzia Squicciarini,et al.  Toward Detecting Compromised MapReduce Workers through Log Analysis , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[16]  Ying Liu,et al.  A Crowdsourcing Worker Quality Evaluation Algorithm on MapReduce for Big Data Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.

[17]  GhemawatSanjay,et al.  The Google file system , 2003 .

[18]  Peter J. Tonellato,et al.  Cloud computing for comparative genomics , 2010, BMC Bioinformatics.

[19]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[20]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[21]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[22]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[23]  Weisong Shi,et al.  CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping , 2011, BMC Research Notes.

[24]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[25]  Geoffrey C. Fox,et al.  IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Cloud Technologies for Bioinformatics Applications , 2022 .

[26]  Bairong Shen,et al.  Translational Biomedical Informatics in the Cloud: Present and Future , 2013, BioMed research international.

[27]  Dae-Ki Kang,et al.  A Novel Parallel Computation Model with Efficient Local Memory Management for Data-Intensive Applications , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[28]  Hai Jiang,et al.  GPU-in-Hadoop: Enabling MapReduce across distributed heterogeneous platforms , 2014, 2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS).

[29]  Rupak Majumdar,et al.  MrCrypt: static analysis for secure cloud computations , 2013, OOPSLA.

[30]  Keke Chen,et al.  CRESP: Towards Optimal Resource Provisioning for MapReduce Computing in Public Clouds , 2014, IEEE Transactions on Parallel and Distributed Systems.

[31]  Asser N. Tantawi,et al.  See Spot Run: Using Spot Instances for MapReduce Workflows , 2010, HotCloud.

[32]  Athanasios V. Vasilakos,et al.  An Advanced MapReduce: Cloud MapReduce, Enhancements and Applications , 2014, IEEE Transactions on Network and Service Management.

[33]  Hai Jin,et al.  Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications , 2015, IEEE Transactions on Parallel and Distributed Systems.