Utilization of Electronic Medical Records and Biomedical Literature to Support the Diagnosis of Rare Diseases Using Data Fusion and Collaborative Filtering Approaches

Background In the United States, a rare disease is characterized as the one affecting no more than 200,000 patients at a certain period. Patients suffering from rare diseases are often either misdiagnosed or left undiagnosed, possibly due to insufficient knowledge or experience with the rare disease on the part of clinical practitioners. With an exponentially growing volume of electronically accessible medical data, a large volume of information on thousands of rare diseases and their potentially associated diagnostic information is buried in electronic medical records (EMRs) and medical literature. Objective This study aimed to leverage information contained in heterogeneous datasets to assist rare disease diagnosis. Phenotypic information of patients existed in EMRs and biomedical literature could be fully leveraged to speed up diagnosis of diseases. Methods In our previous work, we advanced the use of a collaborative filtering recommendation system to support rare disease diagnostic decision making based on phenotypes derived solely from EMR data. However, the influence of using heterogeneous data with collaborative filtering was not discussed, which is an essential problem while facing large volumes of data from various resources. In this study, to further investigate the performance of collaborative filtering on heterogeneous datasets, we studied EMR data generated at Mayo Clinic as well as published article abstracts retrieved from the Semantic MEDLINE Database. Specifically, in this study, we designed different data fusion strategies from heterogeneous resources and integrated them with the collaborative filtering model. Results We evaluated performance of the proposed system using characterizations derived from various combinations of EMR data and literature, as well as with sole EMR data. We extracted nearly 13 million EMRs from the patient cohort generated between 2010 and 2015 at Mayo Clinic and retrieved all article abstracts from the semistructured Semantic MEDLINE Database that were published till the end of 2016. We applied a collaborative filtering model and compared the performance generated by different metrics. Log likelihood ratio similarity combined with k-nearest neighbor on heterogeneous datasets showed the optimal performance in patient recommendation with area under the precision-recall curve (PRAUC) 0.475 (string match), 0.511 (systematized nomenclature of medicine [SNOMED] match), and 0.752 (Genetic and Rare Diseases Information Center [GARD] match). Log likelihood ratio similarity also performed the best with mean average precision 0.465 (string match), 0.5 (SNOMED match), and 0.749 (GARD match). Performance of rare disease prediction was also demonstrated by using the optimal algorithm. Macro-average F-measure for string, SNOMED, and GARD match were 0.32, 0.42, and 0.63, respectively. Conclusions This study demonstrated potential utilization of heterogeneous datasets in a collaborative filtering model to support rare disease diagnosis. In addition to phenotypic-based analysis, in the future, we plan to further resolve the heterogeneity issue and reduce miscommunication between EMR and literature by mining genotypic information to establish a comprehensive disease-phenotype-gene network for rare disease diagnosis.

[1]  Halil Kilicoglu,et al.  Semantic MEDLINE: A web application for managing the results of PubMed searches , 2008, SMBM 2008.

[2]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[3]  J Dudeck,et al.  Transferring data from one EPR to another: content--syntax--semantic. , 1999, Methods of information in medicine.

[4]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[5]  Hongfang Liu,et al.  Using machine learning for concept extraction on clinical documents from multiple data sources , 2011, J. Am. Medical Informatics Assoc..

[6]  Walter V. Sujansky,et al.  Heterogeneous Database Integration in Biomedicine , 2001, J. Biomed. Informatics.

[7]  H L Bleich,et al.  Designing a hospital information system: a comparison of interfaced and integrated systems. , 1992, M.D. computing : computers in medical practice.

[8]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[9]  Richard Bache,et al.  An adaptable architecture for patient cohort identification from diverse data sources. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[10]  Halil Kilicoglu,et al.  Constructing a semantic predication gold standard from the biomedical literature , 2011, BMC Bioinformatics.

[11]  Douglas MacFadden,et al.  SHRINE: Enabling Nationally Scalable Multi-Site Disease Studies , 2013, PloS one.

[12]  William Stafford Noble,et al.  Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure , 2006, Bioinform..

[13]  Yugyung Lee,et al.  BmQGen: Biomedical query generator for knowledge discovery , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[14]  Hongfang Liu,et al.  Constructing Node Embeddings for Human Phenotype Ontology to Assist Phenotypic Similarity Measurement , 2018, 2018 IEEE International Conference on Healthcare Informatics Workshop (ICHI-W).

[15]  Zhi-Dan Zhao,et al.  User-Based Collaborative-Filtering Recommendation Algorithms on Hadoop , 2010, 2010 Third International Conference on Knowledge Discovery and Data Mining.

[16]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[17]  Greg Linden,et al.  Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[18]  Yugyung Lee,et al.  Knowledge Discovery from Biomedical Ontologies in Cross Domains , 2016, PloS one.

[19]  Halil Kilicoglu,et al.  SemMedDB: a PubMed-scale repository of biomedical semantic predications , 2012, Bioinform..

[20]  Halil Kilicoglu,et al.  Semantic MEDLINE: An advanced information management application for biomedicine , 2011, Inf. Serv. Use.

[21]  Marcelo Fiszman,et al.  The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text , 2003, J. Biomed. Informatics.

[22]  Christel Daniel-Le Bozec,et al.  EHR4CR: A Semantic Web Based Interoperability Approach for Reusing Electronic Healthcare Records in Protocol Feasibility Studies , 2012, SWAT4LS.

[23]  Karin M. Verspoor,et al.  Combining heterogeneous data sources for accurate functional annotation of proteins , 2013, BMC Bioinformatics.

[24]  Yoav Shoham,et al.  Fab: content-based, collaborative recommendation , 1997, CACM.

[25]  Jing Li,et al.  Heterogeneous data fusion for alzheimer's disease study , 2008, KDD.

[26]  S. Mundlos,et al.  The Human Phenotype Ontology , 2010, Clinical genetics.

[27]  Johan A. K. Suykens,et al.  L2-norm multiple kernel learning and its application to biomedical data fusion , 2010, BMC Bioinformatics.

[28]  Bamshad Mobasher,et al.  Improving the Effectiveness of Collaborative Filtering on Anonymous Web Usage Data , 2001 .

[29]  Hongfang Liu,et al.  A Comparison of Word Embeddings for the Biomedical Natural Language Processing , 2018, J. Biomed. Informatics.

[30]  Hongfang Liu,et al.  Leveraging Collaborative Filtering to Accelerate Rare Disease Diagnosis , 2017, AMIA.

[31]  C Ohmann,et al.  Future Developments of Medical Informatics from the Viewpoint of Networked Clinical Research , 2009, Methods of Information in Medicine.

[32]  Hans-Ulrich Prokosch,et al.  Ontology-Based Data Integration between Clinical and Research Systems , 2015, PloS one.

[33]  Hongfang Liu,et al.  Phenotypic Analysis of Clinical Narratives Using Human Phenotype Ontology , 2020, MedInfo.