MetaSRA: normalized sample-specific metadata for the Sequence Read Archive

Motivation The NCBI’s Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the SRA do not dictate a standardized set of terms that should be used to describe the biological samples from which the sequencing data are derived. As a result, the metadata include many synonyms, spelling variants, and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues, and cell types present in the SRA. Results We present MetaSRA, a database of normalized SRA sample-specific metadata following a schema inspired by the metadata organization of the ENCODE project. This schema involves mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. We automated these tasks via a novel computational pipeline. Availability The MetaSRA database is available at http://deweylab.biostat.wisc.edu/metasra. Software implementing our computational pipeline is available at https://github.com/deweylab/metasra-pipeline. Contact cdewey@biostat.wisc.edu

[1]  Christoph Steinbeck,et al.  The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013 , 2012, Nucleic Acids Res..

[2]  Christopher G. Chute,et al.  BioPortal: ontologies and integrated data resources at the click of a mouse , 2009, Nucleic Acids Res..

[3]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[4]  Aedín C. Culhane,et al.  Gene Expression Atlas update—a value-added database of microarray and sequencing-based functional genomics experiments , 2011, Nucleic Acids Res..

[5]  Ilaria Bartolini,et al.  String Matching with Metric Trees Using an Approximate Distance , 2002, SPIRE.

[6]  Gang Fu,et al.  Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data , 2014, Nucleic Acids Res..

[7]  Anna Zhukova,et al.  Modeling sample variables with an Experimental Factor Ontology , 2010, Bioinform..

[8]  J. Michael Cherry,et al.  Ontology application and use at the ENCODE DCC , 2015, Database J. Biol. Databases Curation.

[9]  Morris A. Swertz,et al.  SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data , 2015, Database J. Biol. Databases Curation.

[10]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[11]  Rong Chen,et al.  Ontology-driven indexing of public datasets for translational bioinformatics , 2009, BMC Bioinformatics.

[12]  Anni Coden,et al.  The ConceptMapper Approach to Named Entity Recognition , 2010, LREC.

[13]  Leonard A. Smith,et al.  Increasing the Reliability of Reliability Diagrams , 2007 .

[14]  Paul N. Schofield,et al.  The Units Ontology: a tool for integrating units of measurement in science , 2012, Database J. Biol. Databases Curation.

[15]  Tatiana A. Tatusova,et al.  BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata , 2011, Nucleic Acids Res..

[16]  Sampo Pyysalo,et al.  BioCause: Annotating and analysing causality in the biomedical domain , 2013, BMC Bioinformatics.

[17]  Peng Yu,et al.  RNASeqMetaDB: a database and web server for navigating metadata of publicly available mouse RNA-Seq datasets , 2015, Bioinform..

[18]  Sean R. Davis,et al.  SRAdb: query and use public next-generation sequencing data from within R , 2013, BMC Bioinformatics.

[19]  Eugenia Galeota,et al.  Ontology-based annotations and semantic relations in large-scale (epi)genomics data , 2016, Briefings Bioinform..

[20]  S. Lewis,et al.  Uberon, an integrative multi-species anatomy ontology , 2012, Genome Biology.

[21]  M. Ashburner,et al.  An ontology for cell types , 2005, Genome Biology.

[22]  S. Lewis,et al.  Uberon, an integrative multi-species anatomy , 2012 .