Performance evaluation of parallel strategies in public clouds: A study with phylogenomic workflows

Data analysis is an exploratory process that demands high performance computing (HPC). SciPhylomics, for example, is a data-intensive workflow that aims at producing phylogenomic trees based on an input set of protein sequences of genomes to infer evolutionary relationships among living organisms. SciPhylomics can benefit from parallel processing techniques provided by existing approaches such as SciCumulus cloud workflow engine and MapReduce implementations such as Hadoop. Despite some performance fluctuations, computing clouds provide a new dimension for HPC due to its elasticity and availability features. In this paper, we present a performance evaluation for SciPhylomics executions in a real cloud environment. The workflow was executed using two parallel execution approaches (SciCumulus and Hadoop) at the Amazon EC2 cloud. Our results reinforce the benefits of parallelizing data for the phylogenomic inference workflow using MapReduce-like parallel approaches in the cloud. The performance results demonstrate that this class of bioinformatics experiment is suitable to be executed in the cloud despite its need for high performance capabilities. The evaluated workflow shares many features of several data intensive workflows, which present first insights that these cloud execution results can be extrapolated to other classes of experiments.

[1]  Marta Mattoso,et al.  Data parallelism in bioinformatics workflows using Hydra , 2010, HPDC '10.

[2]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[3]  Marta Mattoso,et al.  An adaptive parallel execution strategy for cloud‐based scientific workflows , 2012, Concurr. Comput. Pract. Exp..

[4]  Ling Liu,et al.  Output privacy in data mining , 2011, TODS.

[5]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences: current status, policy and new initiatives , 2008, Nucleic Acids Res..

[6]  Chuong B. Do,et al.  alignment ProbCons : Probabilistic consistency-based multiple sequence data , 2005 .

[7]  Andrew G Clark,et al.  Genomics of the evolutionary process. , 2006, Trends in ecology & evolution.

[8]  Marta Mattoso,et al.  A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds , 2012, Journal of Grid Computing.

[9]  Edward Walker,et al.  Challenges in executing large parameter sweep studies across widely distributed computing environments , 2007, CLADE '07.

[10]  Marius Hillenbrand,et al.  High performance cloud computing , 2013, Future Gener. Comput. Syst..

[11]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[12]  Marta Mattoso,et al.  SciCumulus: A Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[13]  Geoffrey C. Fox,et al.  MPJ: MPI-like message passing for Java , 2000 .

[14]  Alexandru Iosup,et al.  The Characteristics and Performance of Groups of Jobs in Grids , 2007, Euro-Par.

[15]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[16]  Roderic D M Page,et al.  Visualizing Phylogenetic Trees Using TreeView , 2003, Current protocols in bioinformatics.

[17]  Calvin J. Ribbens,et al.  Hybrid Computing - Where HPC meets grid and Cloud Computing , 2011, Future Gener. Comput. Syst..

[18]  D. Rubinsztein Annual Review of Genomics and Human Genetics , 2001 .

[19]  L. Moroz,et al.  Phylogenomics reveals deep molluscan relationships , 2011, Nature.

[20]  Long Zheng,et al.  More convenient more overhead: the performance evaluation of Hadoop streaming , 2011, RACS.

[21]  Don Gilbert,et al.  Sequence File Format Conversion with Command‐Line Readseq , 2003, Current protocols in bioinformatics.

[22]  Zhao Zhang,et al.  Towards Loo on , 2008 .

[23]  Marta Mattoso,et al.  Towards supporting the life cycle of large scale scientific experiments , 2010, Int. J. Bus. Process. Integr. Manag..

[24]  Yong Zhao,et al.  Many-task computing for grids and supercomputers , 2008, 2008 Workshop on Many-Task Computing on Grids and Supercomputers.

[25]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[26]  Bas E. Dutilh,et al.  Assessment of phylogenomic and orthology approaches for phylogenetic inference , 2007, Bioinform..

[27]  V. Vianu,et al.  Edinburgh Why and Where: A Characterization of Data Provenance , 2017 .

[28]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[29]  Marta Mattoso,et al.  Towards a Taxonomy for Cloud Computing from an e-Science Perspective , 2010, Cloud Computing.

[30]  Shiyong Lu,et al.  A MapReduce-Enabled Scientific Workflow Composition Framework , 2009, 2009 IEEE International Conference on Web Services.

[31]  Eero Vainikko,et al.  Adapting scientific computing problems to clouds using MapReduce , 2012, Future Gener. Comput. Syst..

[32]  Zhao Zhang,et al.  Toward loosely coupled programming on petascale systems , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Susan B. Davidson,et al.  Generating sound workflow views for correct provenance analysis , 2011, TODS.

[34]  Carmem S. Hara,et al.  Querying and Managing Provenance through User Views in Scientific Workflows , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[35]  Thomas J Naughton,et al.  Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified , 2006, BMC Evolutionary Biology.

[36]  W. Pearson,et al.  Current Protocols in Bioinformatics , 2002 .

[37]  Vladimir Makarenkov,et al.  Armadillo 1.1: An Original Workflow Platform for Designing and Conducting Phylogenetic Analysis and Simulations , 2012, PloS one.

[38]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[39]  Alexandra J. Scott,et al.  Phylogenomic analysis of bacterial and archaeal sequences with AMPHORA2 , 2012, Bioinform..

[40]  Peter Buneman,et al.  Provenance in databases , 2009, SIGMOD '07.

[41]  E. Sonnhammer,et al.  Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features , 2008, Nucleic acids research.

[42]  Marta Mattoso,et al.  Exploring many task computing in scientific workflows , 2009, MTAGS '09.

[43]  Bertram Ludäscher,et al.  Introducing W.A.T.E.R.S.: a Workflow for the Alignment, Taxonomy, and Ecology of Ribosomal Sequences , 2010, BMC Bioinformatics.

[44]  Suzanne J. Matthews,et al.  MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees , 2010, BMC Bioinformatics.

[45]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[46]  John Shalf,et al.  Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[47]  Sanjeev Khanna,et al.  Optimizing user views for workflows , 2009, ICDT '09.

[48]  Marianne Winslett,et al.  Introducing secure provenance: problems and challenges , 2007, StorageSS '07.

[49]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[50]  Kazutaka Katoh,et al.  Parallelization of the MAFFT multiple sequence alignment program , 2010, Bioinform..

[51]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[52]  Shujia Zhou,et al.  Case study for running HPC applications in public clouds , 2010, HPDC '10.

[53]  Ian Foster,et al.  Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS 2009, November 16, 2009, Portland, Oregon, USA , 2009, SC-MTAGS.

[54]  Marta Mattoso,et al.  SciPhy: A Cloud-Based Workflow for Phylogenetic Analysis of Drug Targets in Protozoan Genomes , 2011, BSB.

[55]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[56]  C. Fraser,et al.  Phylogenomics: Intersection of Evolution and Genomics , 2003, Science.

[57]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[58]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[59]  Ahmed Moustafa,et al.  iTree: A high-throughput phylogenomic pipeline , 2010, 2010 5th Cairo International Biomedical Engineering Conference.

[60]  Geoffrey C. Fox,et al.  Examining the Challenges of Scientific Workflows , 2007, Computer.

[61]  Marta Mattoso,et al.  An algebraic approach for data-centric scientific workflows , 2011, Proc. VLDB Endow..

[62]  F. Delsuc,et al.  Phylogenomics and the reconstruction of the tree of life , 2005, Nature Reviews Genetics.

[63]  Chris Rose,et al.  A Break in the Clouds: Towards a Cloud Definition , 2011 .