论文信息 - A scalable data analysis platform for metagenomics

A scalable data analysis platform for metagenomics

With the advent of high-throughput DNA sequencing technology, the analysis and management of the increasing amount of biological sequence data has become a bottleneck for scientific progress. For example, MG-RAST, a metagenome annotation system serving a large scientific community worldwide, has experienced a sustained, exponential growth in data submissions for several years; and this trend is expected to continue. To address the computational challenges posed by this workload, we developed a new data analysis platform, including a data management system (Shock) for biological sequence data and a workflow management system (AWE) supporting scalable, fault-tolerant task and resource management. Shock and AWE can be used to build a scalable and reproducible data analysis infrastructure for upper-level biological data analysis services.

[1] Edward A. Lee,et al. Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[2] Zhiling Lan,et al. Multi-domain job coscheduling for leadership computing systems , 2012, The Journal of Supercomputing.

[3] Nathan Blow,et al. Metagenomics: Exploring unseen communities , 2008, Nature.

[4] Wu-chun Feng,et al. Accelerating Data-Intensive Genome Analysis in the Cloud , 2013 .

[5] A. Nekrutenko,et al. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[6] Douglas Thain,et al. Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[7] Daniel S. Katz,et al. Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[8] W. J. Kent,et al. BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[9] Andreas Wilke,et al. Using clouds for metagenomics: A case study , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[10] Jing Chen,et al. Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource , 2010, Nucleic Acids Res..

[11] Roy T. Fielding,et al. Principled design of the modern Web architecture , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[12] Andreas Wilke,et al. An experience report: porting the MG‐RAST rapid metagenomics analysis pipeline to the cloud , 2011, Concurr. Comput. Pract. Exp..

[13] Dmitry Pushkarev,et al. Single-molecule sequencing of an individual human genome , 2009, Nature Biotechnology.

[14] Geoffrey C. Fox,et al. Hybrid cloud and cluster computing paradigms for life science applications , 2010, BMC Bioinformatics.

[15] Jano I. van Hemert,et al. Scientific Workflow: A Survey and Research Directions , 2007, PPAM.

[16] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17] Carole A. Goble,et al. Taverna, Reloaded , 2010, SSDBM.

[18] Peter M. Rice,et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[19] Andreas Wilke,et al. phylogenetic and functional analysis of metagenomes , 2022 .

[20] M. DePristo,et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[21] Daniel S. Katz,et al. Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[22] Elizabeth M Glass,et al. From genomics to metagenomics. , 2012, Current opinion in biotechnology.

[23] Lavanya Ramakrishnan,et al. Grid portals for bioinformatics , 2006 .