Longitudinal patient stratification of electronic health records with flexible adjustment for clinical outcomes

The increase in availability of longitudinal electronic health record (EHR) data is leading to improved understanding of diseases and discovery of novel phenotypes. The majority of clustering algorithms focus only on patient trajectories, yet patients with similar trajectories may have different outcomes. Finding subgroups of patients with different trajectories and outcomes can guide future drug development and improve recruitment to clinical trials. We develop a recurrent neural network autoencoder to cluster EHR data using reconstruction, outcome, and clustering losses which can be weighted to find different types of patient clusters. We show our model is able to discover known clusters from both data biases and outcome differences, outperforming baseline models. We demonstrate the model performance on 29, 229 diabetes patients, showing it finds clusters of patients with both different trajectories and different outcomes which can be utilized to aid clinical decision making.

[1]  Xing Qiu,et al.  The impact of quantile and rank normalization procedures on the testing power of gene differential expression analysis , 2013, BMC Bioinformatics.

[2]  Fei Wang,et al.  Patient Subtyping via Time-Aware LSTM Networks , 2017, KDD.

[3]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[4]  Eugenia R. McPeek Hinz,et al.  Assessing electronic health record phenotypes against gold-standard diagnostic criteria for diabetes mellitus , 2017, J. Am. Medical Informatics Assoc..

[5]  Jason P. Fine,et al.  Statistical Primer for Cardiovascular Research Introduction to the Analysis of Survival Data in the Presence of Competing Risks , 2022 .

[6]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[7]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[8]  Mihaela van der Schaar,et al.  Temporal Phenotyping using Deep Predictive Clustering of Disease Progression , 2020, ICML.

[9]  Homa Karimabadi,et al.  Deep Temporal Clustering : Fully Unsupervised Learning of Time-Domain Features , 2018, ArXiv.

[10]  D. Harrington A class of rank test procedures for censored survival data , 1982 .

[11]  James H. Harrison,et al.  Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record , 2018, IEEE Access.

[12]  Li Li,et al.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.

[13]  Jem Rashbass,et al.  Outcome-Oriented Deep Temporal Phenotyping of Disease Progression , 2021, IEEE Transactions on Biomedical Engineering.

[14]  Leif E. Peterson,et al.  Unsupervised cluster analysis and mortality risk in the Digitalis Investigation Group (DIG) trial of heart failure , 2009, 2009 International Joint Conference on Neural Networks.

[15]  Daniel Rueckert,et al.  Deep learning cardiac motion analysis for human survival prediction , 2018, Nature Machine Intelligence.

[16]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[17]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[18]  Fei Wang,et al.  Data-Driven Subtyping of Parkinson’s Disease Using Longitudinal Clinical Records: A Cohort Study , 2019, Scientific Reports.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[21]  E. Kaplan,et al.  Nonparametric Estimation from Incomplete Observations , 1958 .

[22]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.