Interoperable genome annotation with GBOL, an extendable infrastructure for functional data mining

Background A standard structured format is used by the public sequence databases to present genome annotations. A prerequisite for a direct functional comparison is consistent annotation of the genetic elements with evidence statements. However, the current format provides limited support for data mining, hampering comparative analyses at large scale. Results The provenance of a genome annotation describes the contextual details and derivation history of the process that resulted in the annotation. To enable interoperability of genome annotations, we have developed the Genome Biology Ontology Language (GBOL) and associated infrastructure (GBOL stack). GBOL is provenance aware and thus provides a consistent representation of functional genome annotations linked to the provenance. GBOL is modular in design, extendible and linked to existing ontologies. The GBOL stack of supporting tools enforces consistency within and between the GBOL definitions in the ontology (OWL) and the Shape Expressions (ShEx) language describing the graph structure. Modules have been developed to serialize the linked data (RDF) and to generate a plain text format files. Conclusion The main rationale for applying formalized information models is to improve the exchange of information. GBOL uses and extends current ontologies to provide a formal representation of genomic entities, along with their properties and relations. The deliberate integration of data provenance in the ontology enables review of automatically obtained genome annotations at a large scale. The GBOL stack facilitates consistent usage of the ontology.

[1]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[2]  Jasper J. Koehorst,et al.  Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics. , 2016, F1000Research.

[3]  Jasper J. Koehorst,et al.  Comparison of 432 Pseudomonas strains through integration of genomic, functional, metabolic and expression data , 2016, Scientific Reports.

[4]  Dan Brickley,et al.  SKOS Core: Simple knowledge organisation for the Web , 2005, Dublin Core Conference.

[5]  Luisa Montecchi-Palazzi,et al.  The PSI-MOD community standard for representation of protein modification data , 2008, Nature Biotechnology.

[6]  Bonnie E. Shook-Sa,et al.  . CC-BY-NC-ND 4 . 0 International licenseIt is made available under a is the author / funder , who has granted medRxiv a license to display the preprint in perpetuity , 2021 .

[7]  Pierre-Antoine Champin,et al.  JSON-LD 1.1 – A JSON-based Serialization for Linked Data , 2019 .

[8]  Antje Chang,et al.  BRENDA , the enzyme database : updates and major new developments , 2003 .

[9]  Rachael P. Huntley,et al.  Standardized description of scientific evidence using the Evidence Ontology (ECO) , 2014, Database J. Biol. Databases Curation.

[10]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[11]  Oliver Hofmann,et al.  ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level , 2010, Bioinform..

[12]  Benjamin M. Good,et al.  Wikidata: A platform for data integration and dissemination for the life sciences and beyond , 2015, bioRxiv.

[13]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[14]  Eric P. Nawrocki,et al.  NCBI prokaryotic genome annotation pipeline , 2016, Nucleic acids research.

[15]  R. Edwards,et al.  Explaining microbial phenotypes on a genomic scale: GWAS for microbes , 2013, Briefings in functional genomics.

[16]  Laura Paglione,et al.  ORCID: a system to uniquely identify researchers , 2012, Learn. Publ..

[17]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[18]  S. Brunak,et al.  SignalP 4.0: discriminating signal peptides from transmembrane regions , 2011, Nature Methods.

[19]  J. Bard,et al.  Ontologies in biology: design, applications and future challenges , 2004, Nature Reviews Genetics.

[20]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[21]  Derrick E. Fouts,et al.  PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species , 2012, Nucleic acids research.

[22]  Torsten Seemann,et al.  Prokka: rapid prokaryotic genome annotation , 2014, Bioinform..

[23]  Jasper J. Koehorst,et al.  Persistence of Functional Protein Domains in Mycoplasma Species and their Role in Host Specificity and Synthetic Minimal Life , 2017, Front. Cell. Infect. Microbiol..

[24]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[25]  Michael Krawczak,et al.  Where genotype is not predictive of phenotype: towards an understanding of the molecular basis of reduced penetrance in human inherited disease , 2013, Human Genetics.

[26]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[27]  Amit P. Sheth,et al.  Semantic Services, Interoperability and Web Applications - Emerging Concepts , 2011, Semantic Services, Interoperability and Web Applications.

[28]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[29]  Michel Dumontier,et al.  FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation , 2014, Journal of Biomedical Semantics.

[30]  Frank van Harmelen,et al.  Web Ontology Language: OWL , 2004, Handbook on Ontologies.

[31]  Paul N. Schofield,et al.  The role of ontologies in biological and biomedical research: a functional perspective , 2015, Briefings Bioinform..

[32]  James Cheney,et al.  PROV-O: The PROV ontology:W3C recommendation 30 April 2013 , 2013 .

[33]  Christoph Steinbeck,et al.  Rhea—a manually curated resource of biochemical reactions , 2011, Nucleic Acids Res..

[34]  Jesse C. J. van Dam,et al.  RDF2Graph a tool to recover, understand and validate the ontology of an RDF resource , 2015, J. Biomed. Semant..

[35]  Harold R. Solbrig,et al.  Shape expressions: an RDF validation and transformation language , 2014, SEM '14.

[36]  Mark A. Musen,et al.  The protégé project: a look back and a look forward , 2015, SIGAI.

[37]  Allan Kuchinsky,et al.  The Synthetic Biology Open Language (SBOL) provides a community standard for communicating designs in synthetic biology , 2014, Nature Biotechnology.

[38]  K. Lindblad-Toh,et al.  Comparative genomics as a tool to understand evolution and disease , 2013, Genome research.

[39]  Chris F. Taylor,et al.  The minimum information about a genome sequence (MIGS) specification , 2008, Nature Biotechnology.

[40]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..