A scalable data analysis platform for metagenomics

With the advent of high-throughput DNA sequencing technology, the analysis and management of the increasing amount of biological sequence data has become a bottleneck for scientific progress. For example, MG-RAST, a metagenome annotation system serving a large scientific community worldwide, has experienced a sustained, exponential growth in data submissions for several years; and this trend is expected to continue. To address the computational challenges posed by this workload, we developed a new data analysis platform, including a data management system (Shock) for biological sequence data and a workflow management system (AWE) supporting scalable, fault-tolerant task and resource management. Shock and AWE can be used to build a scalable and reproducible data analysis infrastructure for upper-level biological data analysis services.

[1]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[2]  Zhiling Lan,et al.  Multi-domain job coscheduling for leadership computing systems , 2012, The Journal of Supercomputing.

[3]  Nathan Blow,et al.  Metagenomics: Exploring unseen communities , 2008, Nature.

[4]  Wu-chun Feng,et al.  Accelerating Data-Intensive Genome Analysis in the Cloud , 2013 .

[5]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[6]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[7]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[8]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[9]  Andreas Wilke,et al.  Using clouds for metagenomics: A case study , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[10]  Jing Chen,et al.  Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource , 2010, Nucleic Acids Res..

[11]  Roy T. Fielding,et al.  Principled design of the modern Web architecture , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[12]  Andreas Wilke,et al.  An experience report: porting the MG‐RAST rapid metagenomics analysis pipeline to the cloud , 2011, Concurr. Comput. Pract. Exp..

[13]  Dmitry Pushkarev,et al.  Single-molecule sequencing of an individual human genome , 2009, Nature Biotechnology.

[14]  Geoffrey C. Fox,et al.  Hybrid cloud and cluster computing paradigms for life science applications , 2010, BMC Bioinformatics.

[15]  Jano I. van Hemert,et al.  Scientific Workflow: A Survey and Research Directions , 2007, PPAM.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Carole A. Goble,et al.  Taverna, Reloaded , 2010, SSDBM.

[18]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[19]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[20]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[21]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[22]  Elizabeth M Glass,et al.  From genomics to metagenomics. , 2012, Current opinion in biotechnology.

[23]  Lavanya Ramakrishnan,et al.  Grid portals for bioinformatics , 2006 .