An Extensible Ontology Modeling Approach Using Post Coordinated Expressions for Semantic Provenance in Biomedical Research

Provenance metadata describing the source or origin of data is critical to verify and validate results of scientific experiments. Indeed, reproducibility of scientific studies is rapidly gaining significant attention in the research community, for example biomedical and healthcare research. To address this challenge in the biomedical research domain, we have developed the Provenance for Clinical and Healthcare Research (ProvCaRe) using World Wide Web Consortium (W3C) PROV specifications, including the PROV Ontology (PROV-O). In the ProvCaRe project, we are extending PROV-O to create a formal model of provenance information that is necessary for scientific reproducibility and replication in biomedical research. However, there are several challenges associated with the development of the ProvCaRe ontology, including: (1) Ontology engineering: modeling all biomedical provenance-related terms in an ontology has undefined scope and is not feasible before the release of the ontology; (2) Redundancy: there are a large number of existing biomedical ontologies that already model relevant biomedical terms; and (3) Ontology maintenance: adding or deleting terms from a large ontology is error prone and it will be difficult to maintain the ontology over time. Therefore, in contrast to modeling all classes and properties in an ontology before deployment (also called precoordination), we propose the “ProvCaRe Compositional Grammar Syntax” to model ontology classes on-demand (also called postcoordination). The compositional grammar syntax allows us to re-use existing biomedical ontology classes and compose provenance-specific terms that extend PROV-O classes and properties. We demonstrate the application of this approach in the ProvCaRe ontology and the use of the ontology in the development of the ProvCaRe knowledgebase that consists of more than 38 million provenance triples automatically extracted from 384,802 published research articles using a text processing workflow.

[1]  F. Collins,et al.  Policy: NIH plans to enhance reproducibility , 2014, Nature.

[2]  Brian Caffo,et al.  Prospective Study of Sleep-disordered Breathing and Hypertension the Sleep Heart Health Study at a Glance Commentary , 2022 .

[3]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[4]  T. Nies Constraints of the PROV Data Model , 2013 .

[5]  Christopher G. Chute,et al.  The National Center for Biomedical Ontology , 2012, J. Am. Medical Informatics Assoc..

[6]  Amit P. Sheth,et al.  Provenir Ontology: Towards a Framework for eScience Provenance Management , 2009 .

[7]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[8]  Luigi Iannone,et al.  Lexically suggest, logically define: Quality assurance of the use of qualifiers and expected results of post-coordination in SNOMED CT , 2012, J. Biomed. Informatics.

[9]  Robert Stevens,et al.  Mining Taverna's semantic web of provenance , 2008 .

[10]  Mor Peleg,et al.  The Ontology of Clinical Research (OCRe): An informatics foundation for the science of clinical research , 2014, J. Biomed. Informatics.

[11]  Halil Kilicoglu,et al.  Medical Facts to Support Inferencing in Natural Language Processing , 2005, AMIA.

[12]  S. Lazic,et al.  A call for transparent reporting to optimize the predictive value of preclinical research , 2012, Nature.

[13]  Jessica A. Turner,et al.  The Ontology for Biomedical Investigations , 2016, PloS one.

[14]  Catherine P. Jayapandian,et al.  Scaling Up Scientific Discovery in Sleep Medicine: The National Sleep Research Resource. , 2016, Sleep.

[15]  Jimmy J. Lin,et al.  Evaluation of PICO as a Knowledge Representation for Clinical Questions , 2006, AMIA.

[16]  Damian Smedley,et al.  The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data , 2014, Nucleic Acids Res..

[17]  Mor Peleg,et al.  A practical method for transforming free-text eligibility criteria into computable criteria , 2011, J. Biomed. Informatics.

[18]  Amit P. Sheth,et al.  Semantic Provenance for eScience: Managing the Deluge of Scientific Data , 2008, IEEE Internet Computing.

[19]  Olivier Bodenreider,et al.  Bio-ontologies: current trends and future directions , 2006, Briefings Bioinform..

[20]  S. Fenton,et al.  SNOMED CT survey: an assessment of implementation in EMR/EHR applications. , 2008, Perspectives in health information management.

[21]  Yogesh L. Simmhan,et al.  The Open Provenance Model core specification (v1.1) , 2011, Future Gener. Comput. Syst..

[22]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[23]  Susan Redline,et al.  Entering the era of "big data": getting our metrics right. , 2013, Sleep.

[24]  Matthew Kim,et al.  ProvCaRe Semantic Provenance Knowledgebase: Evaluating Scientific Reproducibility of Research Studies , 2017, AMIA.