Study on the semi-supervised learning-based patient similarity from heterogeneous electronic medical records

A new learning-based patient similarity measurement was proposed to measure patients’ similarity for heterogeneous electronic medical records (EMRs) data. We first calculated feature-level similarities according to the features’ attributes. A domain expert provided patient similarity scores of 30 randomly selected patients. These similarity scores and feature-level similarities for 30 patients comprised the labeled sample set, which was used for the semi-supervised learning algorithm to learn the patient-level similarities for all patients. Then we used the k-nearest neighbor (kNN) classifier to predict four liver conditions. The predictive performances were compared in four different situations. We also compared the performances between personalized kNN models and other machine learning models. We assessed the predictive performances by the area under the receiver operating characteristic curve (AUC), F1-score, and cross-entropy (CE) loss. As the size of the random training samples increased, the kNN models using the learned patient similarity to select near neighbors consistently outperformed those using the Euclidean distance to select near neighbors (all P values < 0.001). The kNN models using the learned patient similarity to identify the top k nearest neighbors from the random training samples also had a higher best-performance (AUC: 0.95 vs. 0.89, F1-score: 0.84 vs. 0.67, and CE loss: 1.22 vs. 1.82) than those using the Euclidean distance. As the size of the similar training samples increased, which composed the most similar samples determined by the learned patient similarity, the performance of kNN models using the simple Euclidean distance to select the near neighbors degraded gradually. When exchanging the role of the Euclidean distance, and the learned patient similarity in selecting the near neighbors and similar training samples, the performance of the kNN models gradually increased. These two kinds of kNN models had the same best-performance of AUC 0.95, F1-score 0.84, and CE loss 1.22. Among the four reference models, the highest AUC and F1-score were 0.94 and 0.80, separately, which were both lower than those for the simple and similarity-based kNN models. This learning-based method opened an opportunity for similarity measurement based on heterogeneous EMR data and supported the secondary use of EMR data.

[1]  Anis Sharafoddini,et al.  Patient Similarity in Prediction Models Based on Health Data: A Scoping Review , 2017, JMIR medical informatics.

[2]  A. Jemal,et al.  Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries , 2018, CA: a cancer journal for clinicians.

[3]  Sadok Ben Yahia,et al.  Stability Assess Based on Enhanced Information Content Similarity Measure for Ontology Enrichment , 2014, MEDI.

[4]  Huilong Duan,et al.  A patient-similarity-based model for diagnostic prediction , 2019, Int. J. Medical Informatics.

[5]  Casey S. Greene,et al.  Semi-supervised learning of the electronic health record for phenotype stratification , 2016, J. Biomed. Informatics.

[6]  Joon Lee,et al.  Personalized Mortality Prediction Driven by Electronic Medical Data and a Patient Similarity Metric , 2015, PloS one.

[7]  Dong Xu,et al.  Data Mining in Biomedicine Using Ontologies , 2009 .

[8]  R. Coifman,et al.  Generating Evidence Based Interpretation of Hematology Screens via Anomaly Characterization , 2011 .

[9]  Deng Wu,et al.  Symptom-based network classification identifies distinct clinical subgroups of liver diseases with common molecular pathways , 2019, Comput. Methods Programs Biomed..

[10]  Ben Glocker,et al.  Semi-supervised Learning for Network-Based Cardiac MR Image Segmentation , 2017, MICCAI.

[11]  Zeeshan Syed,et al.  Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data , 2011, J. Mach. Learn. Res..

[12]  Xiaolu Fei,et al.  Study on Patient Similarity Measurement Based on Electronic Medical Records , 2019, MedInfo.

[13]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[14]  Stephan Dreiseitl,et al.  Using concept hierarchies to improve calculation of patient similarity , 2016, J. Biomed. Informatics.

[15]  Fei Wang,et al.  Adaptive semi-supervised recursive tree partitioning: The ART towards large scale patient indexing in personalized healthcare , 2015, J. Biomed. Informatics.

[16]  R. Sharan,et al.  A method for inferring medical diagnoses from patient similarities , 2013, BMC Medicine.

[17]  Feiping Nie,et al.  Trace Ratio Problem Revisited , 2009, IEEE Transactions on Neural Networks.

[18]  Paulo Carvalho,et al.  Prediction of Heart Failure Decompensation Events by Trend Analysis of Telemonitoring Data , 2015, IEEE Journal of Biomedical and Health Informatics.

[19]  Leyu Dai,et al.  Patient similarity: methods and applications , 2020, ArXiv.

[20]  Mohammad Khalilia,et al.  Improving disease prediction using ICD-9 ontological features , 2011, 2011 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2011).

[21]  Changyong Liang,et al.  A case-based reasoning system based on weighted heterogeneous value distance metric for breast cancer diagnosis , 2017, Artif. Intell. Medicine.

[22]  Ian H. Witten,et al.  An open-source toolkit for mining Wikipedia , 2013, Artif. Intell..

[23]  Riccardo Bellazzi,et al.  Patient similarity for precision medicine: A systematic review , 2018, J. Biomed. Informatics.

[24]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[25]  Jun Wang,et al.  Exploring Patient Risk Groups with Incomplete Knowledge , 2013, 2013 IEEE 13th International Conference on Data Mining.

[26]  Benjamin S. Glicksberg,et al.  Identification of type 2 diabetes subgroups through topological analysis of patient similarity , 2015, Science Translational Medicine.

[27]  Fei Wang,et al.  Two Heads Better Than One: Metric+Active Learning and its Applications for IT Service Classification , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[28]  Wei Li,et al.  Global liver disease burdens and research trends: Analysis from a Chinese perspective. , 2019, Journal of hepatology.

[29]  Jason Roy,et al.  Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches , 2010, Medical care.

[30]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[31]  Nicola J. Mulder,et al.  DaGO-Fun: tool for Gene Ontology-based functional analysis using term information content measures , 2013, BMC Bioinformatics.

[32]  S. Hovsepian,et al.  Prevalence of Nonalcoholic Fatty Liver Disease and its Related Metabolic Risk Factors in Isfahan, Iran , 2017, Advanced biomedical research.

[33]  Bixiang Zhang,et al.  Surgical Treatment of Giant Liver Hemangioma Larger Than 10 cm: A Single Center's Experience With 86 Patients , 2015, Medicine.

[34]  David Sánchez,et al.  Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective , 2011, J. Biomed. Informatics.

[35]  Kenney Ng,et al.  Personalized Predictive Modeling and Risk Factor Identification using Patient Similarity , 2015, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[36]  Mario Cannataro,et al.  An experimental study of information content measurement of gene ontology terms , 2018, Int. J. Mach. Learn. Cybern..

[37]  Xiaolu Fei,et al.  Measurement and application of patient similarity in personalized predictive modeling based on electronic medical records , 2019, Biomedical engineering online.

[38]  R. Sharan,et al.  PREDICT: a method for inferring novel drug indications with application to personalized medicine , 2011, Molecular systems biology.

[39]  T. M. Gulik,et al.  Management of giant liver hemangiomas: an update , 2013, Expert review of gastroenterology & hepatology.