Supervised multi-specialist topic model with applications on large-scale electronic health record data

Motivation: Electronic health record (EHR) data provides a new venue to elucidate disease comorbidities and latent phenotypes for precision medicine. To fully exploit its potential, a realistic data generative process of the EHR data needs to be modelled. Materials and Methods: We present MixEHR-S to jointly infer specialist-disease topics from the EHR data. As the key contribution, we model the specialist assignments and ICD-coded diagnoses as the latent topics based on patient's underlying disease topic mixture in a novel unified supervised hierarchical Bayesian topic model. For efficient inference, we developed a closed-form collapsed variational inference algorithm to learn the model distributions of MixEHR-S. Results: We applied MixEHR-S to two independent large-scale EHR databases in Quebec with three targeted applications: (1) Congenital Heart Disease (CHD) diagnostic prediction among 154,775 patients; (2) Chronic obstructive pulmonary disease (COPD) diagnostic prediction among 73,791 patients; (3) future insulin treatment prediction among 78,712 patients diagnosed with diabetes as a mean to assess the disease exacerbation. In all three applications, MixEHR-S conferred clinically meaningful latent topics among the most predictive latent topics and achieved superior target prediction accuracy compared to the existing methods, providing opportunities for prioritizing high-risk patients for healthcare services. Availability and implementation: MixEHR-S source code and scripts of the experiments are freely available at https://github.com/li-lab-mcgill/mixehrS

[1]  Linda R. Petzold,et al.  Survival Topic Models for Predicting Outcomes for Trauma Patients , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[2]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[3]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[4]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[5]  Le Song,et al.  GRAM: Graph-based Attention Model for Healthcare Representation Learning , 2016, KDD.

[6]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Li Li,et al.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.

[9]  Adler J. Perotte,et al.  Learning probabilistic phenotypes from heterogeneous EHR data , 2015, J. Biomed. Informatics.

[10]  Joydeep Ghosh,et al.  Identifiable Phenotyping using Constrained Non-Negative Matrix Factorization , 2016, MLHC.

[11]  Ping Zhang,et al.  Risk Prediction with Electronic Health Records: A Deep Learning Approach , 2016, SDM.

[12]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[13]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[14]  Jimeng Sun,et al.  Building bridges across electronic health record systems through inferred phenotypic topics , 2015, J. Biomed. Informatics.

[15]  David Sontag,et al.  Temporal Convolutional Neural Networks for Diagnosis from Lab Tests , 2015, ArXiv.

[16]  Walter F. Stewart,et al.  Doctor AI: Predicting Clinical Events via Recurrent Neural Networks , 2015, MLHC.

[17]  Jimeng Sun,et al.  Phenotyping using Structured Collective Matrix Factorization of Multi--source EHR Data , 2016, 1609.04466.

[18]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[19]  David L Buckeridge,et al.  Multivariate and Longitudinal Health System Indicators. , 2017, Studies in health technology and informatics.

[20]  Charles Elkan,et al.  Learning to Diagnose with LSTM Recurrent Neural Networks , 2015, ICLR.

[21]  David A. McAllester,et al.  Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence , 2009, UAI 2009.

[22]  Peter Szolovits,et al.  The Use of Autoencoders for Discovering Patient Phenotypes , 2017, ArXiv.

[23]  Jose Davila-Velderrain,et al.  Inferring multimodal latent topics from electronic health records , 2020, Nature Communications.

[24]  David L. Buckeridge,et al.  Modeling disease progression in longitudinal EHR data using continuous-time hidden Markov models , 2018, ArXiv.

[25]  T. Minka Estimating a Dirichlet distribution , 2012 .

[26]  Nilmini Wickramasinghe,et al.  Deepr: A Convolutional Net for Medical Records , 2016, ArXiv.

[27]  Yi Yang,et al.  Recurrent disease progression networks for modelling risk trajectory of heart failure , 2021, PloS one.

[28]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[29]  Jeffrey Dean,et al.  Scalable and accurate deep learning with electronic health records , 2018, npj Digital Medicine.

[30]  Rafael E. Riveros,et al.  Studies in Health Technology and Informatics , 2005 .