Embedding Complexity In the Data Representation Instead of In the Model: A Case Study Using Heterogeneous Medical Data

Electronic Health Records have become popular sources of data for secondary research, but their use is hampered by the amount of effort it takes to overcome the sparsity, irregularity, and noise that they contain. Modern learning architectures can remove the need for expert-driven feature engineering, but not the need for expert-driven preprocessing to abstract away the inherent messiness of clinical data. This preprocessing effort is often the dominant component of a typical clinical prediction project. In this work we propose using semantic embedding methods to directly couple the raw, messy clinical data to downstream learning architectures with truly minimal preprocessing. We examine this step from the perspective of capturing and encoding complex data dependencies in the data representation instead of in the model, which has the nice benefit of allowing downstream processing to be done with fast, lightweight, and simple models accessible to researchers without machine learning expertise. We demonstrate with three typical clinical prediction tasks that the highly compressed, embedded data representations capture a large amount of useful complexity, although in some cases the compression is not completely lossless.

[1]  Fei Wang,et al.  Deep learning for healthcare: review, opportunities and challenges , 2018, Briefings Bioinform..

[2]  Jeffrey Dean,et al.  Scalable and accurate deep learning with electronic health records , 2018, npj Digital Medicine.

[3]  Parisa Rashidi,et al.  Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis , 2017, IEEE Journal of Biomedical and Health Informatics.

[4]  The amazing power of word vectors , 2018 .

[5]  Yuan Luo,et al.  Recurrent Neural Networks for Classifying Relations in Clinical Notes , 2017, AMIA.

[6]  N. Cox,et al.  Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record , 2017, PloS one.

[7]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[8]  Truyen Tran,et al.  Predicting healthcare trajectories from medical records: A deep learning approach , 2017, J. Biomed. Informatics.

[9]  Thomas A. Lasko,et al.  Predicting Medications from Diagnostic Codes with Recurrent Neural Networks , 2016, ICLR.

[10]  Jimeng Sun,et al.  Using recurrent neural network models for early detection of heart failure onset , 2016, J. Am. Medical Informatics Assoc..

[11]  Svetha Venkatesh,et al.  $\mathtt {Deepr}$: A Convolutional Net for Medical Records , 2016, IEEE Journal of Biomedical and Health Informatics.

[12]  Nilmini Wickramasinghe,et al.  Deepr: A Convolutional Net for Medical Records , 2016, ArXiv.

[13]  Li Li,et al.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.

[14]  Li Li,et al.  Deep Learning to Predict Patient Future Diseases from the Electronic Health Records , 2016, ECIR.

[15]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[16]  Jimeng Sun,et al.  Multi-layer Representation Learning for Medical Concepts , 2016, KDD.

[17]  Jimeng Sun,et al.  Medical Concept Representation Learning from Electronic Health Records and its Application on Heart Failure Prediction , 2016, ArXiv.

[18]  Svetha Venkatesh,et al.  DeepCare: A Deep Dynamic Memory Model for Predictive Medicine , 2016, PAKDD.

[19]  Walter F. Stewart,et al.  Doctor AI: Predicting Clinical Events via Recurrent Neural Networks , 2015, MLHC.

[20]  Adler J. Perotte,et al.  Learning probabilistic phenotypes from heterogeneous EHR data , 2015, J. Biomed. Informatics.

[21]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[22]  D. Roden,et al.  Biobanks and Electronic Medical Records: Enabling Cost-Effective Research , 2014, Science Translational Medicine.

[23]  Jimeng Sun,et al.  Predicting changes in hypertension control using electronic health records from a chronic disease management program , 2014, J. Am. Medical Informatics Assoc..

[24]  Thomas A. Lasko,et al.  Efficient Inference of Gaussian-Process-Modulated Renewal Processes with Application to Medical Event Data , 2014, UAI.

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  T. Lasko,et al.  Computational Phenotype Discovery Using Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data , 2013, PloS one.

[27]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[28]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[29]  Hua Xu,et al.  Portability of an algorithm to identify rheumatoid arthritis in electronic health records , 2012, J. Am. Medical Informatics Assoc..

[30]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[31]  I. Kohane Using electronic health records to drive discovery in disease genomics , 2011, Nature Reviews Genetics.

[32]  D. Roden,et al.  The Emerging Role of Electronic Medical Records in Pharmacogenomics , 2011, Clinical pharmacology and therapeutics.

[33]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[34]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[35]  Marylyn D. Ritchie,et al.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations , 2010, Bioinform..

[36]  I. Kohane,et al.  Electronic medical records for discovery research in rheumatoid arthritis , 2010, Arthritis care & research.

[37]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[38]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[39]  Charles Safran,et al.  Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[40]  John F. Hurdle,et al.  Measuring diagnoses: ICD code accuracy. , 2005, Health services research.

[41]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.