Machine Learning Methods for Disease Prediction with Claims Data

One of the primary challenges of healthcare delivery is aggregating disparate, asynchronous data sources into meaningful indicators of individual health. We combine natural language word embedding and network modeling techniques to learn meaningful representations of medical concepts by using the weighted network adjacency matrix in the GloVe algorithm, which we call Code2Vec. We demonstrate that using our learned embeddings improve neural network performance for disease prediction. However, we also demonstrate that popular deep learning models for disease prediction are not meaningfully better than simpler, more interpretable classifiers such as XGBoost. Additionally, our work adds to the current literature by providing a comprehensive survey of various machine learning algorithms on disease prediction tasks.

[1]  Li Li,et al.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  David Sontag,et al.  Temporal Convolutional Neural Networks for Diagnosis from Lab Tests , 2015, ArXiv.

[4]  David Sontag,et al.  Learning Low-Dimensional Representations of Medical Concepts , 2016, CRI.

[5]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Xiang Wang,et al.  Unsupervised learning of disease progression models , 2014, KDD.

[8]  N. Tangri,et al.  A predictive model for progression of chronic kidney disease to kidney failure. , 2011, JAMA.

[9]  David A. Sontag,et al.  Population-Level Prediction of Type 2 Diabetes From Claims Data and Analysis of Risk Factors , 2015, Big Data.

[10]  Charles Elkan,et al.  Learning to Diagnose with LSTM Recurrent Neural Networks , 2015, ICLR.

[11]  Yu Cheng,et al.  Boosting Deep Learning Risk Prediction with Generative Adversarial Networks for Electronic Health Records , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[12]  Stephen W. Sorensen,et al.  Lifetime risk for diabetes mellitus in the United States. , 2003, JAMA.

[13]  Walter F. Stewart,et al.  Doctor AI: Predicting Clinical Events via Recurrent Neural Networks , 2015, MLHC.

[14]  Andrew P. Bradley,et al.  Intelligible Support Vector Machines for Diagnosis of Diabetes Mellitus , 2010, IEEE Transactions on Information Technology in Biomedicine.

[15]  Jeffrey Humpherys,et al.  Code2Vec: Embedding and Clustering Medical Diagnosis Data , 2017, 2017 IEEE International Conference on Healthcare Informatics (ICHI).

[16]  Mohammad Khalilia,et al.  Predicting disease risks from highly imbalanced data using random forest , 2011, BMC Medical Informatics Decis. Mak..

[17]  Katherine E Henson,et al.  Risk of Suicide After Cancer Diagnosis in England , 2018, JAMA psychiatry.

[18]  Le Song,et al.  GRAM: Graph-based Attention Model for Healthcare Representation Learning , 2016, KDD.

[19]  Adler J. Perotte,et al.  Deep Survival Analysis , 2016, MLHC.

[20]  Adler J. Perotte,et al.  Risk prediction for chronic kidney disease progression using heterogeneous electronic health record data and time series analysis , 2015, J. Am. Medical Informatics Assoc..

[21]  Muin J. Khoury,et al.  Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes , 2010, BMC Medical Informatics Decis. Mak..

[22]  G. Colditz,et al.  Weight Gain as a Risk Factor for Clinical Diabetes Mellitus in Women , 1995, Annals of Internal Medicine.

[23]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[24]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  P. Mecocci,et al.  Random Forest ensembles for detection and prediction of Alzheimer's disease with a good between-cohort robustness , 2014, NeuroImage: Clinical.

[27]  Jimeng Sun,et al.  Multi-layer Representation Learning for Medical Concepts , 2016, KDD.

[28]  Ralph Snyderman,et al.  Personalized health care: From theory to practice , 2012, Biotechnology journal.