Phenotype Instance Verification and Evaluation Tool (PIVET): A Scaled Phenotype Evidence Generation Framework Using Web-Based Medical Literature

Background Researchers are developing methods to automatically extract clinically relevant and useful patient characteristics from raw healthcare datasets. These characteristics, often capturing essential properties of patients with common medical conditions, are called computational phenotypes. Being generated by automated or semiautomated, data-driven methods, such potential phenotypes need to be validated as clinically meaningful (or not) before they are acceptable for use in decision making. Objective The objective of this study was to present Phenotype Instance Verification and Evaluation Tool (PIVET), a framework that uses co-occurrence analysis on an online corpus of publically available medical journal articles to build clinical relevance evidence sets for user-supplied phenotypes. PIVET adopts a conceptual framework similar to the pioneering prototype tool PheKnow-Cloud that was developed for the phenotype validation task. PIVET completely refactors each part of the PheKnow-Cloud pipeline to deliver vast improvements in speed without sacrificing the quality of the insights PheKnow-Cloud achieved. Methods PIVET leverages indexing in NoSQL databases to efficiently generate evidence sets. Specifically, PIVET uses a succinct representation of the phenotypes that corresponds to the index on the corpus database and an optimized co-occurrence algorithm inspired by the Aho-Corasick algorithm. We compare PIVET’s phenotype representation with PheKnow-Cloud’s by using PheKnow-Cloud’s experimental setup. In PIVET’s framework, we also introduce a statistical model trained on domain expert–verified phenotypes to automatically classify phenotypes as clinically relevant or not. Additionally, we show how the classification model can be used to examine user-supplied phenotypes in an online, rather than batch, manner. Results PIVET maintains the discriminative power of PheKnow-Cloud in terms of identifying clinically relevant phenotypes for the same corpus with which PheKnow-Cloud was originally developed, but PIVET’s analysis is an order of magnitude faster than that of PheKnow-Cloud. Not only is PIVET much faster, it can be scaled to a larger corpus and still retain speed. We evaluated multiple classification models on top of the PIVET framework and found ridge regression to perform best, realizing an average F1 score of 0.91 when predicting clinically relevant phenotypes. Conclusions Our study shows that PIVET improves on the most notable existing computational tool for phenotype validation in terms of speed and automation and is comparable in terms of accuracy.

[1]  L.M. Sheikh,et al.  Interesting measures for mining association rules , 2004, 8th International Multitopic Conference, 2004. Proceedings of INMIC 2004..

[2]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[3]  Sathiamoorthy Manoharan,et al.  A performance comparison of SQL and NoSQL databases , 2013, 2013 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM).

[4]  Zhiyong Lu,et al.  PubMed and beyond: a survey of web tools for searching biomedical literature , 2011, Database J. Biol. Databases Curation.

[5]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[6]  Lee Hooper,et al.  Publication bias: what is it? How do we measure it? How do we avoid it? , 2013 .

[7]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[8]  Jyoti Rani,et al.  pubmed.mineR: An R package with text-mining algorithms to analyse PubMed abstracts , 2015, Journal of Biosciences.

[9]  Janos X. Binder,et al.  DISEASES: Text mining and data integration of disease–gene associations , 2014, bioRxiv.

[10]  K. Dickersin The existence of publication bias and risk factors for its occurrence. , 1990, JAMA.

[11]  Sebastián Ventura,et al.  Pattern Mining with Evolutionary Algorithms , 2016, Springer International Publishing.

[12]  Ulf Leser,et al.  ALIBABA: PubMed as a graph , 2006, Bioinform..

[13]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[14]  Melissa A. Basford,et al.  Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. , 2010, American journal of human genetics.

[15]  Dietrich Rebholz-Schuhmann,et al.  PhenoMiner: from text to a database of phenotypes associated with OMIM diseases , 2015, Database J. Biol. Databases Curation.

[16]  Sally Hopewell,et al.  Publication bias in clinical trials due to statistical significance or direction of trial results. , 2009, The Cochrane database of systematic reviews.

[17]  Johannes M Freudenberg,et al.  Mining emerging biomedical literature for understanding disease associations in drug discovery. , 2014, Methods in molecular biology.

[18]  P. Easterbrook,et al.  Publication bias in clinical research , 1991, The Lancet.

[19]  Hyunju Lee,et al.  An analysis of disease-gene relationship from Medline abstracts by DigSee , 2017, Scientific Reports.

[20]  Sophia Ananiadou,et al.  Using text mining techniques to extract phenotypic information from the PhenoCHF corpus , 2015, BMC Medical Informatics and Decision Making.

[21]  Matthias Frisch,et al.  LitInspector: literature and signal transduction pathway mining in PubMed abstracts , 2009, Nucleic Acids Res..

[22]  Florin Radulescu,et al.  MongoDB vs Oracle -- Database Comparison , 2012, 2012 Third International Conference on Emerging Intelligent Data and Web Technologies.

[23]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[24]  P. Ekmekci,et al.  An increasing problem in publication ethics: Publication bias and editors’ role in avoiding it , 2017, Medicine, health care, and philosophy.

[25]  Joydeep Ghosh,et al.  PheKnow–Cloud: A Tool for Evaluating High-Throughput Phenotype Candidates using Online Medical Literature , 2017, CRI.

[26]  Joydeep Ghosh,et al.  Automated Verification of Phenotypes using PubMed , 2016, BCB.

[27]  Paul A. Harris,et al.  PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability , 2016, J. Am. Medical Informatics Assoc..

[28]  Jean-Baptiste Lamy,et al.  PyMedTermino: an open-source generic API for advanced terminology services , 2015, MIE.

[29]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[30]  R. Simes,et al.  Publication bias: evidence of delayed publication in a cohort study of clinical research projects , 1997, BMJ.

[31]  Jimeng Sun,et al.  Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics , 2015, KDD.

[32]  J. Ioannidis,et al.  Systematic Review of the Empirical Evidence of Study Publication Bias and Outcome Reporting Bias , 2008, PloS one.

[33]  Jimeng Sun,et al.  Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization , 2014, KDD.

[34]  Dong Xu,et al.  DTMiner: identification of potential disease targets through biomedical literature mining , 2016, Bioinform..

[35]  George Hripcsak,et al.  Birth month affects lifetime disease risk: a phenome-wide method , 2015, J. Am. Medical Informatics Assoc..

[36]  F. Song,et al.  Dissemination and publication of research findings: an updated review of related biases. , 2010, Health technology assessment.

[37]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..