Workflow and web application for annotating NCBI BioProject transcriptome data

Abstract The volume of transcriptome data is growing exponentially due to rapid improvement of experimental technologies. In response, large central resources such as those of the National Center for Biotechnology Information (NCBI) are continually adapting their computational infrastructure to accommodate this large influx of data. New and specialized databases, such as Transcriptome Shotgun Assembly Sequence Database (TSA) and Sequence Read Archive (SRA), have been created to aid the development and expansion of centralized repositories. Although the central resource databases are under continual development, they do not include automatic pipelines to increase annotation of newly deposited data. Therefore, third-party applications are required to achieve that aim. Here, we present an automatic workflow and web application for the annotation of transcriptome data. The workflow creates secondary data such as sequencing reads and BLAST alignments, which are available through the web application. They are based on freely available bioinformatics tools and scripts developed in-house. The interactive web application provides a search engine and several browser utilities. Graphical views of transcript alignments are available through SeqViewer, an embedded tool developed by NCBI for viewing biological sequence data. The web application is tightly integrated with other NCBI web applications and tools to extend the functionality of data processing and interconnectivity. We present a case study for the species Physalis peruviana with data generated from BioProject ID 67621. Database URL: http://www.ncbi.nlm.nih.gov/projects/physalis/

[1]  Matthew Fraser,et al.  InterProScan 5: genome-scale protein function classification , 2014, Bioinform..

[2]  Christoph Steinbeck,et al.  Omics Discovery Index - Discovering and Linking Public ‘Omics’ Datasets , 2016, bioRxiv.

[3]  J. Xiang,et al.  Whole Transcriptome Analysis Provides Insights into Molecular Mechanisms for Molting in Litopenaeus vannamei , 2015, PloS one.

[4]  Juan Miguel García-Gómez,et al.  BIOINFORMATICS APPLICATIONS NOTE Sequence analysis Manipulation of FASTQ data with Galaxy , 2005 .

[5]  D. Landsman,et al.  Identification of Immunity Related Genes to Study the Physalis peruviana – Fusarium oxysporum Pathosystem , 2013, PloS one.

[6]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[7]  Martin O. Jones,et al.  afterParty: turning raw transcriptomes into permanent resources , 2012, BMC Bioinformatics.

[8]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[9]  Gina A. Garzón-Martínez,et al.  The Physalis peruviana leaf transcriptome: assembly, annotation and gene model prediction , 2012, BMC Genomics.

[10]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[11]  Damian Szklarczyk,et al.  eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations , 2009, Nucleic Acids Res..

[12]  Sándor Pongor,et al.  JBioWH: an open-source Java framework for bioinformatics data integration , 2013, Database J. Biol. Databases Curation.

[13]  Hugo Devillers,et al.  Enhancing Structural Annotation of Yeast Genomes with RNA-Seq Data. , 2016, Methods in molecular biology.

[14]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[15]  Alan M. Durham,et al.  The Eimeria Transcript DB: an integrated resource for annotated transcripts of protozoan parasites of the genus Eimeria , 2013, Database J. Biol. Databases Curation.

[16]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[17]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[18]  L. Mueller,et al.  Association analysis for disease resistance to Fusarium oxysporum in cape gooseberry (Physalis peruviana L) , 2016, BMC Genomics.

[19]  L. Mariño-Ramírez,et al.  Development and Characterization of Microsatellite Markers for the Cape Gooseberry Physalis peruviana , 2011, PloS one.

[20]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[21]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[22]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[23]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[24]  N. Tangthawornchaikul,et al.  Comparison of phi29-based whole genome amplification and whole transcriptome amplification in dengue virus. , 2014, Journal of virological methods.

[25]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[26]  Han Liu,et al.  Whole transcriptome expression profiling of mouse limb tendon development by using RNA‐seq , 2015, Journal of orthopaedic research : official publication of the Orthopaedic Research Society.

[27]  L. Pongor,et al.  Fast and Sensitive Alignment of Microbial Whole Genome Sequencing Reads to Large Sequence Datasets on a Desktop PC: Application to Metagenomic Datasets and Pathogen Identification , 2014, PloS one.

[28]  Mario Rosario Guarracino,et al.  Transcriptator: An Automated Computational Pipeline to Annotate Assembled Reads and Identify Non Coding RNA , 2015, PloS one.

[29]  Genetic diversity and population structure in Physalis peruviana and related taxa based on InDels and SNPs derived from COSII and IRG markers. , 2015, Plant gene.

[30]  Mattia D'Antonio,et al.  ASPicDB: a database web tool for alternative splicing analysis. , 2015, Methods in molecular biology.

[31]  J. Wolf Principles of transcriptome analysis and gene expression quantification: an RNA‐seq tutorial , 2013, Molecular Ecology Resources.

[32]  Jeremy Jay,et al.  EchinoDB, an application for comparative transcriptomics of deeply-sampled clades of echinoderms , 2016, BMC Bioinformatics.

[33]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.