The Empusa code generator and its application to GBOL, an extendable ontology for genome annotation

The RDF data model facilitates integration of diverse data available in structured and semi-structured formats. To obtain a coherent RDF graph the chosen ontology must be consistently applied. However, addition of new diverse data causes the ontology to evolve, which could lead to accumulation of unintended erroneous composites. Thus, there is a need for a gate keeping system that compares the intended content described in the ontology with the actual content of the resource. The Empusa code generator facilitates creation of composite RDF resources from disparate sources. Empusa can convert a schema into an associated application programming interface (API), that can be used to perform data consistency checks and generates Markdown documentation to make persistent URLs resolvable. Using Empusa consistency is ensured within and between the ontology and the content of the resource. As an illustration of the potential of Empusa, we present the Genome Biology Ontology Language (GBOL). GBOL uses and extends current ontologies to provide a formal representation of genomic entities, along with their properties, relations and provenance.

[1]  G Stix,et al.  The mice that warred. , 2001, Scientific American.

[2]  Michel Dumontier,et al.  FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation , 2016, Journal of biomedical semantics.

[3]  Rachael P. Huntley,et al.  Standardized description of scientific evidence using the Evidence Ontology (ECO) , 2014, Database J. Biol. Databases Curation.

[4]  Chris F. Taylor,et al.  The minimum information about a genome sequence (MIGS) specification , 2008, Nature Biotechnology.

[5]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[6]  Emily S. Charlson,et al.  Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications , 2011, Nature Biotechnology.

[7]  Jasper J. Koehorst,et al.  Comparison of 432 Pseudomonas strains through integration of genomic, functional, metabolic and expression data , 2016, Scientific Reports.

[8]  Harold R. Solbrig,et al.  Shape expressions: an RDF validation and transformation language , 2014, SEM '14.

[9]  Jasper J. Koehorst,et al.  Persistence of Functional Protein Domains in Mycoplasma Species and their Role in Host Specificity and Synthetic Minimal Life , 2017, Front. Cell. Infect. Microbiol..

[10]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[11]  Mark A. Musen,et al.  The protégé project: a look back and a look forward , 2015, SIGAI.

[12]  Luisa Montecchi-Palazzi,et al.  The PSI-MOD community standard for representation of protein modification data , 2008, Nature Biotechnology.

[13]  Andrew M. Jenkinson,et al.  The EBI RDF platform: linked open data for the life sciences , 2014, Bioinform..

[14]  Karen Spärck Jones A Look Back and a Look Forward , 1988, SIGIR Forum.

[15]  V. M. D. Martins dos Santos,et al.  The diurnal transcriptional landscape of the microalga Tetradesmus obliquus , 2018, bioRxiv.

[16]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[17]  Christoph Steinbeck,et al.  The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013 , 2012, Nucleic Acids Res..

[18]  Allan Kuchinsky,et al.  The Synthetic Biology Open Language (SBOL) provides a community standard for communicating designs in synthetic biology , 2014, Nature Biotechnology.

[19]  Cclrc Rutherford,et al.  SKOS Core: Simple Knowledge Organisation for the Web , 2006 .

[20]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[21]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.

[22]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[23]  Henning Hermjakob,et al.  The Reactome pathway Knowledgebase , 2015, Nucleic acids research.

[24]  Deborah L. McGuinness,et al.  PROV-O: The PROV Ontology , 2013 .

[25]  Michel Dumontier,et al.  FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation , 2014, Journal of Biomedical Semantics.

[26]  Timothy L. Bailey,et al.  Tissue-specific prediction of directly regulated genes , 2011, Bioinform..

[27]  Antje Chang,et al.  BRENDA , the enzyme database : updates and major new developments , 2003 .

[28]  F. Arnaud,et al.  From core referencing to data re-use: two French national initiatives to reinforce paleodata stewardship (National Cyber Core Repository and LTER France Retro-Observatory) , 2017 .

[29]  Dan Brickley,et al.  SKOS Core: Simple knowledge organisation for the Web , 2005, Dublin Core Conference.

[30]  Benjamin M. Good,et al.  Wikidata: A platform for data integration and dissemination for the life sciences and beyond , 2015, bioRxiv.

[31]  Eric P. Nawrocki,et al.  NCBI prokaryotic genome annotation pipeline , 2016, Nucleic acids research.

[32]  Adam Wojciechowski,et al.  Experimental Evaluation of Pair Programming , 2001 .

[33]  James A. Hendler,et al.  The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities , 2001 .

[34]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[35]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[36]  S. Brunak,et al.  SignalP 4.0: discriminating signal peptides from transmembrane regions , 2011, Nature Methods.

[37]  J. Bard,et al.  Ontologies in biology: design, applications and future challenges , 2004, Nature Reviews Genetics.

[38]  Edoardo Saccenti,et al.  SAPP: functional genome annotation and analysis through a semantic framework using FAIR principles , 2017, Bioinform..

[39]  Jesse C. J. van Dam,et al.  RDF2Graph a tool to recover, understand and validate the ontology of an RDF resource , 2015, J. Biomed. Semant..

[40]  Dan Brickley,et al.  FOAF Vocabulary Specification , 2004 .

[41]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[42]  Amit P. Sheth,et al.  Semantic Services, Interoperability and Web Applications - Emerging Concepts , 2011, Semantic Services, Interoperability and Web Applications.

[43]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[44]  Torsten Seemann,et al.  Prokka: rapid prokaryotic genome annotation , 2014, Bioinform..

[45]  Harold R. Solbrig,et al.  Validating RDF with Shape Expressions , 2014, ArXiv.

[46]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[47]  Jasper J. Koehorst,et al.  Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics. , 2016, F1000Research.

[48]  José Emilio Labra Gayo,et al.  Semantics and Validation of Shapes Schemas for RDF , 2014, SEMWEB.

[49]  Oliver Hofmann,et al.  ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level , 2010, Bioinform..