Standardized Metadata for Human Pathogen/Vector Genomic Sequences

High throughput sequencing has accelerated the determination of genome sequences for thousands of human infectious disease pathogens and dozens of their vectors. The scale and scope of these data are enabling genotype-phenotype association studies to identify genetic determinants of pathogen virulence and drug/insecticide resistance, and phylogenetic studies to track the origin and spread of disease outbreaks. To maximize the utility of genomic sequences for these purposes, it is essential that metadata about the pathogen/vector isolate characteristics be collected and made available in organized, clear, and consistent formats. Here we report the development of the GSCID/BRC Project and Sample Application Standard, developed by representatives of the Genome Sequencing Centers for Infectious Diseases (GSCIDs), the Bioinformatics Resource Centers (BRCs) for Infectious Diseases, and the U.S. National Institute of Allergy and Infectious Diseases (NIAID), part of the National Institutes of Health (NIH), informed by interactions with numerous collaborating scientists. It includes mapping to terms from other data standards initiatives, including the Genomic Standards Consortium’s minimal information (MIxS) and NCBI’s BioSample/BioProjects checklists and the Ontology for Biomedical Investigations (OBI). The standard includes data fields about characteristics of the organism or environmental source of the specimen, spatial-temporal information about the specimen isolation event, phenotypic characteristics of the pathogen/vector isolated, and project leadership and support. By modeling metadata fields into an ontology-based semantic framework and reusing existing ontologies and minimum information checklists, the application standard can be extended to support additional project-specific data fields and integrated with other data represented with comparable standards. The use of this metadata standard by all ongoing and future GSCID sequencing projects will provide a consistent representation of these data in the BRC resources and other repositories that leverage these data, allowing investigators to identify relevant genomic sequences and perform comparative genomics analyses that are both statistically meaningful and biologically relevant.

Brett E. Pickett | Richard H. Scheuermann | Lynn M. Schriml | Rick L. Stevens | William C. Nierman | Jessica C. Kissinger | David S. Roos | Christian J. Stoeckert | Scott J. Emrich | Frank H. Collins | Timothy B. Stockwell | Ilene Karsch-Mizrachi | Tanya Barrett | Rebecca Will | Ruchi M. Newman | Matthew R. Henn | Luke Tallon | Gloria I. Giraldo-Calderón | Christina A. Cuomo | R. Burke Squires | Daniel E. Neafsey | Owen White | Mark Eppinger | Indresh Singh | Lisa Sadzewicz | Karen E. Nelson | Herve Tettelin | David Wentworth | Emmanuel F. Mongodin | Bruno Sobral | Scott Durkin | Jie Zheng | Joana C. Silva | B. Birren | O. White | T. Stockwell | B. Sobral | R. Scheuermann | C. Fraser | H. Tettelin | V. Di Francesco | T. Barrett | C. Stoeckert | K. Nelson | D. Roos | W. Nierman | D. Rasko | G. Myers | E. Mongodin | W. F. Fricke | R. Stevens | J. Wortman | M. Henn | L. Brinkac | I. Karsch-Mizrachi | D. Ward | V. Francesco | A. Yao | M. Feldgarden | Maria Y. Giovanni | L. Schriml | Indresh Singh | J. Kissinger | C. Cuomo | D. Neafsey | R. B. Squires | B. Pickett | Yun Zhang | S. Emrich | Eun Mi Lee | L. Tallon | J. Hotopp | Daniel E. Sullivan | R. Will | M. Eppinger | Lisa Sadzewicz | E. Hine | S. Chapman | F. Collins | O. Harb | Vincent M. Bruno | Vivien G. Dugan | Cheryl I Murphy | Julia Puzak | S. Durkin | Bruce Birren | Alison Yao | Michael Feldgarden | Jie Zheng | Elizabet Caler | P. Mathur | D. Wentworth | Cheryl I. Murphy | W. Florian Fricke | Yun Zhang | Doyle V. Ward | Omar S. Harb | Claire Fraser | Lauren Brinkac | Erin Hine | Dan E. Sullivan | Elizabet Caler | Sinéad Chapman | Valentina Di Francesco | Maria Giovanni | Julie Dunning Hotopp | Punam Mathur | Garry Myers | Julia Puzak | David Rasko | Jennifer Wortman | Alison Yao | V. Bruno | L. Sadzewicz | V. Dugan

[1]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[2]  Joana C. Silva,et al.  Genetic loci associated with delayed clearance of Plasmodium falciparum following artemisinin treatment in Southeast Asia , 2012, Proceedings of the National Academy of Sciences.

[3]  A. Camilli,et al.  A fine scale phenotype–genotype virulence map of a bacterial pathogen , 2012, Genome research.

[4]  Patricia L. Whetzel,et al.  OntoMaton: a Bioportal powered ontology widget for Google Spreadsheets , 2012, Bioinform..

[5]  Eileen Kraemer,et al.  EuPathDB: The Eukaryotic Pathogen database , 2012, Nucleic Acids Res..

[6]  Alla Lapidus,et al.  A Bioinformatician's Guide to Metagenomics , 2008, Microbiology and Molecular Biology Reviews.

[7]  Oliver Hofmann,et al.  ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level , 2010, Bioinform..

[8]  Nigel W. Hardy,et al.  Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project , 2008, Nature Biotechnology.

[9]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[10]  Jessica A. Turner,et al.  Modeling biomedical experimental processes with OBI , 2010, J. Biomed. Semant..

[11]  Yan Zhang,et al.  PATRIC, the bacterial bioinformatics database and analysis resource , 2013, Nucleic Acids Res..

[12]  Gautier Koscielny,et al.  VectorBase: improvements to a bioinformatics resource for invertebrate vector genomics , 2011, Nucleic Acids Res..

[13]  S. Behura,et al.  Mosquito genomics: progress and challenges. , 2012, Annual review of entomology.

[14]  Rick L. Stevens,et al.  National Institute of Allergy and Infectious Diseases Bioinformatics Resource Centers: New Assets for Pathogen Informatics , 2007, Infection and Immunity.

[15]  Chris F. Taylor,et al.  The minimum information about a genome sequence (MIGS) specification , 2008, Nature Biotechnology.

[16]  G. Cochrane,et al.  The Genomic Standards Consortium , 2011, PLoS biology.

[17]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[18]  Paul L. Carmichael,et al.  Genomic phenotyping of the essential and non-essential yeast genome detects novel pathways for alkylation resistance , 2011, BMC Systems Biology.

[19]  I-Min A. Chen,et al.  The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata , 2011, Nucleic Acids Res..

[20]  Yun Zhang,et al.  ViPR: an open bioinformatics database and analysis resource for virology research , 2011, Nucleic Acids Res..

[21]  M Drancourt,et al.  Plague in the genomic area. , 2012, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[22]  Christopher G. Chute,et al.  BioPortal: ontologies and integrated data resources at the click of a mouse , 2009, Nucleic Acids Res..

[23]  Richard H Scheuermann,et al.  Influenza Research Database: an integrated bioinformatics resource for influenza research and surveillance , 2012, Influenza and other respiratory viruses.

[24]  I-Min A. Chen,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[25]  Tatiana A. Tatusova,et al.  BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata , 2011, Nucleic Acids Res..

[26]  Gloria I. Giraldo-Calderón,et al.  A “Genome-to-Lead” Approach for Insecticide Discovery: Pharmacological Characterization and Screening of Aedes aegypti D1-like Dopamine Receptors , 2012, PLoS neglected tropical diseases.

[27]  Csongor Nyulas,et al.  BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications , 2011, Nucleic Acids Res..

[28]  Barry Smith,et al.  SNAP and SPAN: Towards Dynamic Spatial Ontology , 2004, Spatial Cogn. Comput..

[29]  Emily S. Charlson,et al.  Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications , 2011, Nature Biotechnology.

[30]  Rolf Apweiler,et al.  The Ontology Lookup Service, a lightweight cross-platform tool for controlled vocabulary queries , 2006, BMC Bioinformatics.