ProvCaRe Semantic Provenance Knowledgebase: Evaluating Scientific Reproducibility of Research Studies

Scientific reproducibility is critical for biomedical research as it enables us to advance science by building on previous results, helps ensure the success of increasingly expensive drug trials, and allows funding agencies to make informed decisions. However, there is a growing "crisis" of reproducibility as evidenced by a recent Nature journal survey of more than 1500 researchers that found that 70% of researchers were not able to replicate results from other research groups and more than 50% of researchers were not able reproduce their own research results. In 2016, the National Institutes of Health (NIH) announced the "Rigor and Reproducibility" guidelines to support reproducibility in biomedical research. A key component of the NIH Rigor and Reproducibility guidelines is the recording and analysis of "provenance" information, which describes the origin or history of data and plays a central role in ensuring scientific reproducibility. As part of the NIH Big Data to Knowledge (BD2K)-funded data provenance project, we have developed a new informatics framework called Provenance for Clinical and Healthcare Research (ProvCaRe) to extract, model, and analyze provenance information from published literature describing research studies. Using sleep medicine research studies that have made their data available through the National Sleep Research Resource (NSRR), we have developed an automated pipeline to identify and extract provenance metadata from published literature that is made available for analysis in the ProvCaRe knowledgebase. NSRR is the largest repository of sleep data from over 40,000 studies involving 36,000 participants and we used 75 published articles describing 6 research studies to populate the ProvCaRe knowledgebase. We evaluated the ProvCaRe knowledgebase with 28,474 "provenance triples" using hypothesis-driven queries to identify and rank research studies based on the provenance information extracted from published articles.

[1]  Timothy O. Laumann,et al.  Informatics and Data Mining Tools and Strategies for the Human Connectome Project , 2011, Front. Neuroinform..

[2]  Jimmy J. Lin,et al.  Evaluation of PICO as a Knowledge Representation for Clinical Questions , 2006, AMIA.

[3]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[4]  J. Brooks Why most published research findings are false: Ioannidis JP, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece , 2008 .

[5]  John P. A. Ioannidis,et al.  Reproducible Research Practices and Transparency across the Biomedical Literature , 2016, PLoS biology.

[6]  Peter Buneman,et al.  Provenance in databases , 2009, SIGMOD '07.

[7]  S. Lazic,et al.  A call for transparent reporting to optimize the predictive value of preclinical research , 2012, Nature.

[8]  F. Collins,et al.  Policy: NIH plans to enhance reproducibility , 2014, Nature.

[9]  Deborah L. McGuinness,et al.  PROV-O: The PROV Ontology , 2013 .

[10]  Brian Caffo,et al.  Prospective Study of Sleep-disordered Breathing and Hypertension the Sleep Heart Health Study at a Glance Commentary , 2022 .

[11]  Martha Palmer,et al.  Transition-based Semantic Role Labeling Using Predicate Argument Clustering , 2011, RELMS@ACL.

[12]  S. Spencer,et al.  Morphological Patterns of Seizures Recorded Intracranially , 1992, Epilepsia.

[13]  Christopher G. Chute,et al.  The National Center for Biomedical Ontology , 2012, J. Am. Medical Informatics Assoc..

[14]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[15]  Carol Friedman,et al.  A broad-coverage natural language processing system , 2000, AMIA.

[16]  Mor Peleg,et al.  The Ontology of Clinical Research (OCRe): An informatics foundation for the science of clinical research , 2014, J. Biomed. Informatics.

[17]  Peter F. Patel-Schneider,et al.  OWL 2 Web Ontology Language Primer (Second Edition) , 2012 .

[18]  Catherine P. Jayapandian,et al.  Scaling Up Scientific Discovery in Sleep Medicine: The National Sleep Research Resource. , 2016, Sleep.

[19]  S. Fenton,et al.  SNOMED CT survey: an assessment of implementation in EMR/EHR applications. , 2008, Perspectives in health information management.

[20]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[21]  I. Cockburn,et al.  The Economics of Reproducibility in Preclinical Research , 2015, PLoS biology.

[22]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[23]  Sean Bechhofer,et al.  The OWL API: A Java API for OWL ontologies , 2011, Semantic Web.

[24]  Robert Stevens,et al.  Knowledge Discovery for Biology with Taverna , 2006 .

[25]  Mark A. Musen,et al.  The Open Biomedical Annotator , 2009, Summit on translational bioinformatics.