Cyberinfrastructure resources enabling creation of the loblolly pine reference transcriptome

Today's genomics technologies generate more sequence data than ever before possible, and at substantially lower costs, serving researchers across biological disciplines in transformative ways. Building transcriptome assemblies from RNA sequencing reads is one application of next-generation sequencing (NGS) that has held a central role in biological discovery in both model and non-model organisms, with and without whole genome sequence references. A major limitation in effective building of transcriptome references is no longer the sequencing data generation itself, but the computing infrastructure and expertise needed to assemble, analyze and manage the data. Here we describe a currently available resource dedicated to achieving such goals, and its use for extensive RNA assembly of up to 1.3 billion reads representing the massive transcriptome of loblolly pine, using four major assembly software installations. The Mason cluster, an XSEDE second tier resource at Indiana University, provides the necessary fast CPU cycles, large memory, and high I/O throughput for conducting large-scale genomics research. The National Center for Genome Analysis Support, or NCGAS, provides technical support in using HPC systems, bioinformatic support for determining the appropriate method to analyze a given dataset, and practical assistance in running computations. We demonstrate that a sufficient supercomputing resource and good workflow design are elements that are essential to large eukaryotic genomics and transcriptomics projects such as the complex transcriptome of loblolly pine, gene expression data that inform annotation and functional interpretation of the largest genome sequence reference to date.

[1]  Craig A. Stewart,et al.  Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond , 2012 .

[2]  Victor Hazlewood,et al.  Methods For Creating XSEDE Compatible Clusters , 2014, XSEDE '14.

[3]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[4]  Geoffrey C. Fox,et al.  What is cyberinfrastructure , 2010, SIGUCCS '10.

[5]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[6]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[7]  Hans A. Vasquez-Gross,et al.  Unique Features of the Loblolly Pine (Pinus taeda L.) Megagenome Revealed Through Sequence Annotation , 2014, Genetics.

[8]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[9]  Douglas G. Scofield,et al.  The Norway spruce genome sequence and conifer genome evolution , 2013, Nature.

[10]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[11]  Tim H. Brom,et al.  A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data , 2012, 1203.4802.

[12]  Matthew W. Vaughn,et al.  National Center for Genome Analysis support leverages XSEDE to support life science research , 2013, XSEDE.

[13]  S. Salzberg,et al.  Sequencing and Assembly of the 22-Gb Loblolly Pine Genome , 2014, Genetics.

[14]  Le-Shin Wu,et al.  Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies , 2014, Genome Biology.

[15]  Steven J. M. Jones,et al.  De novo assembly and analysis of RNA-seq data , 2010, Nature Methods.

[16]  Scott Lathrop,et al.  Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment , 2014 .

[17]  Le-Shin Wu,et al.  Trinity RNA-Seq assembler performance optimization , 2012, XSEDE '12.

[18]  Nancy Wilkins-Diehr Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery , 2013 .

[19]  Inanç Birol,et al.  Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data , 2013, Bioinform..