Shifting the bioinformatics computing paradigm: A case study in parallelizing genome annotation using MAKER and Work Queue

Next generation sequencing technologies have enabled various entities, ranging from large sequencing centers to individual laboratories, to sequence organisms of choice and analyze them on demand. Sequencing and analysis, however, is only part of the equation: to learn about a certain organism, scientists need to annotate it. Each of these problems is highly parallel at a basic level of computation; however, only a few applications support single parallelization frameworks such as MPI. Because of the overall increasing demand for computational analysis and the inherent parallelism available in these problems, applications should easily run on clusters, clouds, and/or grids (even simultaneously); this would enable labs of various sizes to harness the computing power available to them without forcing them to invest in a particular type of batch system. Here we describe modifications made to one particular tool, MAKER. MAKER is a tool for genome annotation that is provided as both a serial application and as an MPI application. We make modifications to enable it to run without MPI and to utilize a wide variety of distributed computing platforms. Further, our proposed parallel framework allows for easy explicit data transfer, which helps overcome a major limitation of bioinformatics tools that generally rely on a shared filesystem. The distributed computing framework we chose to utilize can be used, even during early stages of development, to run bioinformatics tools on clusters, grids, and clouds. We present an evaluation of our modifications using the Caenorhabditis japonica genome comprising 180 megabases of data and achieve a speedup of 45× using 50 workers.

[1]  Michael Mikolajczak,et al.  Designing And Building Parallel Programs: Concepts And Tools For Parallel Software Engineering , 1997, IEEE Concurrency.

[2]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[3]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[4]  W Brad Barbazuk,et al.  Gene discovery and annotation using LCM-454 transcriptome sequencing. , 2006, Genome research.

[5]  Douglas Thain,et al.  Taming complex bioinformatics workflows with weaver, makeflow, and starch , 2010, The 5th Workshop on Workflows in Support of Large-Scale Science.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Miron Livny,et al.  Condor and the Grid , 2003 .

[8]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[9]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[10]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[11]  Nora J. Besansky,et al.  Genome Analysis Of Vectorial Capacity In Major Anopheles Vectors Of Malaria Parasites , 2008 .

[12]  Paul W. Sternberg,et al.  WormBase: network access to the genome and biology of Caenorhabditis elegans , 2001, Nucleic Acids Res..

[13]  Wael Hassan Simplified Wrapper and Interface Generator , 2000 .

[14]  Douglas Thain,et al.  Highly scalable genome assembly on campus grids , 2009, MTAGS '09.

[15]  Sofia M. C. Robb,et al.  MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. , 2007, Genome research.

[16]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[17]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[18]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[19]  Douglas Thain,et al.  Adapting bioinformatics applications for heterogeneous systems: a case study , 2011, ECMLS '11.

[20]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[21]  V. Solovyev,et al.  Ab initio gene finding in Drosophila genomic DNA. , 2000, Genome research.

[22]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[23]  Keith Bradnam,et al.  CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes , 2007, Bioinform..

[24]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[25]  Douglas Thain,et al.  Work Queue + Python: A Framework For Scalable Scientific Ensemble Applications , 2011 .

[26]  Wolfgang Gentzsch,et al.  Sun Grid Engine: towards creating a compute power grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.