The Empusa code generator and its application to GBOL, an extendable ontology for genome annotation

The RDF data model facilitates integration of diverse data available in structured and semi-structured formats. To obtain an RDF graph with a low amount of errors and internal redundancy, the chosen ontology must be consistently applied. However, with each addition of new diverse data the ontology must evolve thereby increasing its complexity, which could lead to accumulation of unintended erroneous composites. Thus, there is a need for a gatekeeping system that compares the intended content described in the ontology with the actual content of the resource. Here we present Empusa, a tool that has been developed to facilitate the creation of composite RDF resources from disparate sources. Empusa can be used to convert a schema into an associated application programming interface (API) that can be used to perform data consistency checks and generates Markdown documentation to make persistent URLs resolvable. In this way, the use of Empusa ensures consistency within and between the ontology (OWL), the Shape Expressions (ShEx) describing the graph structure, and the content of the resource. Background & Summary Semantic Web technologies provide information retrieval and management systems to integrate heterogeneous data from disparate sources [1]. The RDF data model is a W3C standard for storage of information in the form of self-descriptive Subject, Predicate and Object triples that can be linked in an RDF-graph [2, 3]. The use of retrievable controlled vocabularies enables integration of heterogeneous diverse data from different sources in a single repository and SPARQL can be used to query the so generated resources [4, 5]. ar X iv :1 81 2. 04 38 6v 1 [ cs .D B ] 1 1 D ec 2 01 8 By themselves, RDF graphs have no predefined structure nor a schema, and the structure of an RDF resource can vary as new triples are added. Therefore, a formal definition of the relations among the terms, called an ontology, is required to efficiently retrieve linked information from these resources. Structural information can be encoded using Web Ontology Language (OWL) files [6]. RDFS is another, related, standard to define the structure of an RDF resource [7]. In this standard, each object can be defined as an instance of a class and each link as the realisation of a property. Shape Expressions (ShEx) is a standard to describe, validate and transform RDF data. One of the goals of this standard is to create an easy to read language for the validation of instance data [8, 9, 10]. In previous work, we developed RDF2Graph, a tool to automatically recover the structure of an RDF resource and to generate a visualisation, ShEx file and/or an OWL ontology thereof [11]. Application of RDF2Graph to resources providing data in the RDF data model in the life sciences domain such as Reactome, ChEBI, UniProt, or those transformed by the Bio2RDF project [12, 13, 14, 15, 16] showed mismatches between the retrieved data structure and the one described in the OWL definition of the particular resource. The main reason for this lack of consistency is the flexibility provided by RDF: the data graph is a free format, the ontology defines the structure but does not enforce it. In the development of RDF resources, transformation of existing data into the RDF data model is often a source of errors such as typing errors in the predicates, instances with missing attributes, instances that did have a non-unique IRI, and instances that had no type defined, among others. Development of tools that directly use the RDF data model as means to store their output may therefore be essential to unlock the potential of these technologies in the life sciences. An example of a such tool is the Semantic Annotation Platform with Provenance (SAPP) [17], that can automatically annotate genome sequences using standard tools and directly store the annotation results and their provenance in the RDF data model using the Genome Biology Ontology Language (GBOL) [18]. Development of such tools would be greatly facilitated by supporting tools able to read an ontology definition and generate code that can be used for data generation, export and validation. Here we present Empusa, that has been developed to facilitate the creation of RDF resources, which are validated upon creation (figure 1). Empusa uses an OWL and a simplified version of ShEx, defining an ontology, and generates an associated application programming interface (API) that can be used to perform data consistency checks. The use of Empusa ensures consistency within and between the ontology (OWL), the Shape Expressions (ShEx) describing the graph structure and the content of the resource. In addition, Markdown documentation is generated, making URLs related to the ontology resolvable [24].

[1]  Mark A. Musen,et al.  The protégé project: a look back and a look forward , 2015, SIGAI.

[2]  Allan Kuchinsky,et al.  The Synthetic Biology Open Language (SBOL) provides a community standard for communicating designs in synthetic biology , 2014, Nature Biotechnology.

[3]  Christoph Steinbeck,et al.  The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013 , 2012, Nucleic Acids Res..

[4]  James A. Hendler,et al.  A new form of Web content that is meaningful to computers will unleash a revolution of new possibili , 2002 .

[5]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[6]  Harold R. Solbrig,et al.  Validating RDF with Shape Expressions , 2014, ArXiv.

[7]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[8]  Adam Wojciechowski,et al.  Experimental Evaluation of Pair Programming , 2001 .

[9]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[10]  Amit P. Sheth,et al.  Semantic Services, Interoperability and Web Applications - Emerging Concepts , 2011, Semantic Services, Interoperability and Web Applications.

[11]  Andrew M. Jenkinson,et al.  The EBI RDF platform: linked open data for the life sciences , 2014, Bioinform..

[12]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[13]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[14]  Edoardo Saccenti,et al.  SAPP: functional genome annotation and analysis through a semantic framework using FAIR principles , 2017, Bioinform..

[15]  Jesse C. J. van Dam,et al.  RDF2Graph a tool to recover, understand and validate the ontology of an RDF resource , 2015, J. Biomed. Semant..

[16]  Jon Olav Vik,et al.  Interoperable genome annotation with GBOL, an extendable infrastructure for functional data mining , 2017, bioRxiv.

[17]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.

[18]  Harold R. Solbrig,et al.  Shape expressions: an RDF validation and transformation language , 2014, SEM '14.