Performance Analysis of MapReduce Implementations for High Performance Homology Search (Unrefereed Workshop Manuscript)

Homology search to be used in emerging bioinformatics problems such as metagenomics is of increasing importance and challenge as its application area grows more broadly while the computational complexity is increasing, thus requiring massive parallel data processing. Earlier work by some of the authors have devised novel algorithms such as GHOSTX, but the master-worker parallelization to enumerate and schedule for data processing was done with a privately developed, MPI-based master-worker framework called GHOST-MP. An alternative is to utilize the now-popular big data software substrates, such as MapReduce with abundant associated software tool-chains, but it is not clear whether the massive resource required by metagenomic homology search would not overwhelm its known limitations. By converting the GHOST-MP master-worker data processing pipeline to accommodate MapReduce, and benchmarking them on a variety of high-performance MapReduce incarnations including Hadoop and Spark, we attempt to characterize the appropriateness of MapReduce as a generic framework for metagenomics that embody extremely resource consuming requirements for both compute and data.

[1]  Shuji Suzuki,et al.  GHOSTX: An Improved Sequence Homology Search Algorithm Using a Query Suffix Array and a Database Suffix Array , 2014, PloS one.

[2]  Gianluigi Zanetti,et al.  Biodoop: Bioinformatics on Hadoop , 2009, 2009 International Conference on Parallel Processing Workshops.

[3]  Motohiko Matsuda,et al.  K MapReduce: A scalable tool for data-processing and search/ensemble applications on large-scale supercomputers , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[4]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[5]  Wei Cao,et al.  bCloudBLAST: An efficient mapreduce program for bioinformatics applications , 2011, 2011 4th International Conference on Biomedical Engineering and Informatics (BMEI).

[6]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[7]  Feng Yang,et al.  Bwasw-Cloud: Efficient sequence alignment algorithm for two big data with MapReduce , 2014, The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014).

[8]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[9]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[10]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[11]  Satoshi Matsuoka,et al.  A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large-Scale Heterogeneous Supercomputers , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[12]  Fumiyoshi Shoji,et al.  The K computer Operations: Experiences and Statistics , 2014, ICCS.

[13]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[14]  G. Zanetti,et al.  Parallelizing bioinformatics applications with MapReduce , 2008 .

[15]  Srikumar Venugopal,et al.  Scalable Protein Sequence Similarity Search using Locality-Sensitive Hashing and MapReduce , 2013, ArXiv.

[16]  R. Knight,et al.  The human microbiome project: exploring the microbial part of ourselves in a changing world , 2022 .

[17]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[18]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[19]  Satoshi Matsuoka,et al.  Out-of-core GPU memory management for MapReduce-based large-scale graph processing , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[20]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[21]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[22]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.