Enabling large‐scale next‐generation sequence assembly with Blacklight

A variety of extremely challenging biological sequence analyses were conducted on the XSEDE large shared memory resource Blacklight, using current bioinformatics tools and encompassing a wide range of scientific applications. These include genomic sequence assembly, very large metagenomic sequence assembly, transcriptome assembly, and sequencing error correction. The data sets used in these analyses included uncategorized fungal species, reference microbial data, very large soil and human gut microbiome sequence data, and primate transcriptomes, composed of both short‐read and long‐read sequence data. A new parallel command execution program was developed on the Blacklight resource to handle some of these analyses. These results, initially reported previously at XSEDE13 and expanded here, represent significant advances for their respective scientific communities. The breadth and depth of the results achieved demonstrate the ease of use, versatility, and unique capabilities of the Blacklight XSEDE resource for scientific analysis of genomic and transcriptomic sequence data, and the power of these resources, together with XSEDE support, in meeting the most challenging scientific problems. Copyright © 2014 John Wiley & Sons, Ltd.

[1]  Robert L. Grossman,et al.  The Case for Cloud Computing , 2009, IT Professional.

[2]  M. B. Couger,et al.  The Genome of the Anaerobic Fungus Orpinomyces sp. Strain C1A Reveals the Unique Evolutionary History of a Remarkable Plant Biomass Degrader , 2013, Applied and Environmental Microbiology.

[3]  J. Mattick,et al.  Long non-coding RNAs: insights into functions , 2009, Nature Reviews Genetics.

[4]  James Taylor,et al.  Next-generation sequencing data interpretation: enhancing reproducibility and accessibility , 2012, Nature Reviews Genetics.

[5]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[6]  Katharine Sanderson,et al.  Lignocellulose: A chewy problem , 2011, Nature.

[7]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[8]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[9]  Jean Thierry-Mieg,et al.  The non-human primate reference transcriptome resource (NHPRTR) for comparative functional genomics , 2012, Nucleic Acids Res..

[10]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[11]  Christopher E. Mason,et al.  Enabling large‐scale next‐generation sequence assembly with Blacklight , 2013, XSEDE.

[12]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[13]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[14]  Shashikant Kulkarni,et al.  Assuring the quality of next-generation sequencing in clinical laboratory practice , 2012, Nature Biotechnology.

[15]  A G Brownlee,et al.  Remarkably AT-rich genomic DNA from the anaerobic fungus Neocallimastix. , 1989, Nucleic acids research.

[16]  Xin Chen,et al.  dbCAN: a web resource for automated carbohydrate-active enzyme annotation , 2012, Nucleic Acids Res..

[17]  M. Borodovsky,et al.  Ab initio gene identification in metagenomic sequences , 2010, Nucleic acids research.

[18]  S. Tringe,et al.  Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen , 2011, Science.

[19]  Le-Shin Wu,et al.  Trinity RNA-Seq assembler performance optimization , 2012, XSEDE '12.

[20]  Chrystala Constantinidou,et al.  Genome sequencing in clinical microbiology , 2012, Nature Biotechnology.

[21]  Brandi L. Cantarel,et al.  The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics , 2008, Nucleic Acids Res..

[22]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.