Using classification models for the generation of disease-specific medications from biomedical literature and clinical data repository

OBJECTIVE Mining disease-specific associations from existing knowledge resources can be useful for building disease-specific ontologies and supporting knowledge-based applications. Many association mining techniques have been exploited. However, the challenge remains when those extracted associations contained much noise. It is unreliable to determine the relevance of the association by simply setting up arbitrary cut-off points on multiple scores of relevance; and it would be expensive to ask human experts to manually review a large number of associations. We propose that machine-learning-based classification can be used to separate the signal from the noise, and to provide a feasible approach to create and maintain disease-specific vocabularies. METHOD We initially focused on disease-medication associations for the purpose of simplicity. For a disease of interest, we extracted potentially treatment-related drug concepts from biomedical literature citations and from a local clinical data repository. Each concept was associated with multiple measures of relevance (i.e., features) such as frequency of occurrence. For the machine purpose of learning, we formed nine datasets for three diseases with each disease having two single-source datasets and one from the combination of previous two datasets. All the datasets were labeled using existing reference standards. Thereafter, we conducted two experiments: (1) to test if adding features from the clinical data repository would improve the performance of classification achieved using features from the biomedical literature only, and (2) to determine if classifier(s) trained with known medication-disease data sets would be generalizable to new disease(s). RESULTS Simple logistic regression and LogitBoost were two classifiers identified as the preferred models separately for the biomedical-literature datasets and combined datasets. The performance of the classification using combined features provided significant improvement beyond that using biomedical-literature features alone (p-value<0.001). The performance of the classifier built from known diseases to predict associated concepts for new diseases showed no significant difference from the performance of the classifier built and tested using the new disease's dataset. CONCLUSION It is feasible to use classification approaches to automatically predict the relevance of a concept to a disease of interest. It is useful to combine features from disparate sources for the task of classification. Classifiers built from known diseases were generalizable to new diseases.

[1]  Hans-Peter Kriegel,et al.  Extraction of semantic biomedical relations from text using conditional random fields , 2008, BMC Bioinformatics.

[2]  Jeffrey P. Ferraro,et al.  An ontology-driven, diagnostic modeling system. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[3]  Martin Hofmann-Apitius,et al.  ADO: A disease ontology representing the domain knowledge specific to Alzheimer's disease , 2014, Alzheimer's & Dementia.

[4]  Guilherme Del Fiol,et al.  Generating disease-pertinent treatment vocabularies from MEDLINE citations , 2017, J. Biomed. Informatics.

[5]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[6]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[7]  Pradeep Kumar Ray,et al.  Validating an ontology-based algorithm to identify patients with Type 2 Diabetes Mellitus in Electronic Health Records , 2014, Int. J. Medical Informatics.

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[10]  George Hripcsak,et al.  Mining a clinical data warehouse to discover disease-finding associations using co-occurrence statistics , 2005, AMIA.

[11]  Rong Xu,et al.  dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text , 2014, BMC Bioinformatics.

[12]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[13]  Adam Wright,et al.  An automated technique for identifying associations between medications, laboratory results and problems , 2010, J. Biomed. Informatics.

[14]  Dragomir R. Radev,et al.  Identifying gene-disease associations using centrality on a literature mined gene-interaction network , 2008, ISMB.

[15]  Mark H Johnson,et al.  The development of spatial frequency biases in face recognition. , 2010, Journal of experimental child psychology.

[16]  Guilherme Del Fiol,et al.  A method for the development of disease-specific reference standards vocabularies from textual biomedical literature resources , 2016, Artif. Intell. Medicine.

[17]  Rong Xu,et al.  Large-scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug repurposing , 2013, BMC Bioinformatics.

[18]  Xiaoyan Wang,et al.  Automated Knowledge Acquisition from Clinical Narrative Reports , 2008, AMIA.

[19]  Marcelo Fiszman,et al.  The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text , 2003, J. Biomed. Informatics.

[20]  Martin Hofmann-Apitius,et al.  PDON: Parkinson’s disease ontology for representation and modeling of the Parkinson’s disease knowledge domain , 2015, Theoretical Biology and Medical Modelling.

[21]  Zhiyong Lu,et al.  Automatic integration of drug indications from multiple health resources , 2010, IHI.

[22]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[23]  Dongwook Shin,et al.  Degree centrality for semantic abstraction summarization of therapeutic studies , 2011, J. Biomed. Informatics.

[24]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[25]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[28]  E. A. Fox,et al.  Combining the Evidence of Multiple Query Representations for Information Retrieval , 1995, Inf. Process. Manag..

[29]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[30]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[31]  James J. Cimino,et al.  Automated knowledge extraction from the UMLS , 1998, AMIA.

[32]  Marcelo Fiszman,et al.  Semantic Interpretation for the Biomedical Research Literature , 2005 .

[33]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[34]  George Hripcsak,et al.  A statistical methodology for analyzing co-occurrence data from a large sample , 2007, J. Biomed. Informatics.

[35]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[36]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[37]  George Hripcsak,et al.  Automated acquisition of disease drug knowledge from biomedical and clinical documents: an initial study. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[38]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[39]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[40]  Eibe Frank,et al.  Logistic Model Trees , 2003, Machine Learning.

[41]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.