Modeling healthcare data using multiple-channel latent Dirichlet allocation

Information and communications technologies have enabled healthcare institutions to accumulate large amounts of healthcare data that include diagnoses, medications, and additional contextual information such as patient demographics. To gain a better understanding of big healthcare data and to develop better data-driven clinical decision support systems, we propose a novel multiple-channel latent Dirichlet allocation (MCLDA) approach for modeling diagnoses, medications, and contextual information in healthcare data. The proposed MCLDA model assumes that a latent health status group structure is responsible for the observed co-occurrences among diagnoses, medications, and contextual information. Using a real-world research testbed that includes one million healthcare insurance claim records, we investigate the utility of MCLDA. Our empirical evaluation results suggest that MCLDA is capable of capturing the comorbidity structures and linking them with the distribution of medications. Moreover, MCLDA is able to identify the pairing between diagnoses and medications in a record based on the assigned latent groups. MCLDA can also be employed to predict missing medications or diagnoses given partial records. Our evaluation results also show that, in most cases, MCLDA outperforms alternative methods such as logistic regressions and the k-nearest-neighbor (KNN) model for two prediction tasks, i.e., medication and diagnosis prediction. Thus, MCLDA represents a promising approach to modeling healthcare data for clinical decision support.

[1]  F. Hsiao,et al.  Adherence and medication utilisation patterns of fixed‐dose and free combination of angiotensin receptor blocker/thiazide diuretics among newly diagnosed hypertensive patients: a population‐based cohort study , 2015, International journal of clinical practice.

[2]  George Karypis,et al.  FISM: factored item similarity models for top-N recommender systems , 2013, KDD.

[3]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Ramesh Nallapati,et al.  Joint latent topic models for text and citations , 2008, KDD.

[5]  George Hripcsak,et al.  Automated acquisition of disease drug knowledge from biomedical and clinical documents: an initial study. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[6]  F. Hsiao,et al.  Cardiovascular and gastrointestinal events of three antiplatelet therapies: clopidogrel, clopidogrel plus proton-pump inhibitors, and aspirin plus proton-pump inhibitors in patients with previous gastrointestinal bleeding , 2010, Journal of Gastroenterology.

[7]  George T. Duncan,et al.  Automatic detection of omissions in medication lists , 2011, J. Am. Medical Informatics Assoc..

[8]  David W. Bates,et al.  A method and knowledge base for automated inference of patient problems from structured data in an electronic medical record , 2011, J. Am. Medical Informatics Assoc..

[9]  Huilong Duan,et al.  Discovery of clinical pathway patterns from event logs using probabilistic topic models , 2014, J. Biomed. Informatics.

[10]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[11]  T. Minka Estimating a Dirichlet distribution , 2012 .

[12]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[13]  Jimeng Sun,et al.  Building bridges across electronic health record systems through inferred phenotypic topics , 2015, J. Biomed. Informatics.

[14]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[15]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[16]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[17]  David W. Bates,et al.  Improving completeness of electronic problem lists through clinical decision support: a randomized, controlled trial , 2012, J. Am. Medical Informatics Assoc..

[18]  Carol Friedman,et al.  Deriving comorbidities from medical records using Natural Language Processing , 2013, AMIA.

[19]  Daniel M Kaplan Clear writing, clear thinking and the disappearing art of the problem list. , 2007, Journal of hospital medicine.

[20]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[21]  Søren Brunak,et al.  Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts , 2011, PLoS Comput. Biol..

[22]  Hongfang Liu,et al.  Representing information in patient reports using natural language processing and the extensible markup language. , 1999, Journal of the American Medical Informatics Association : JAMIA.

[23]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[24]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[25]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[27]  Daniel Nikovski,et al.  Constructing Bayesian Networks for Medical Diagnosis from Incomplete and Partially Correct Statistics , 2000, IEEE Trans. Knowl. Data Eng..

[28]  F. Hsiao,et al.  Off-Label Antibiotic Use in the Pediatric Population: A Population-based Study in Taiwan , 2012 .

[29]  George Hripcsak,et al.  Automated encoding of clinical documents based on natural language processing. , 2004, Journal of the American Medical Informatics Association : JAMIA.

[30]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[31]  George Karypis,et al.  Item-based top-N recommendation algorithms , 2004, TOIS.

[32]  G. B. Smith,et al.  Preface to S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images” , 1987 .

[33]  Bracha Shapira,et al.  Recommender Systems Handbook , 2015, Springer US.

[34]  Joseph M. Hellerstein,et al.  USHER: Improving data quality with dynamic forms , 2011, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[35]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Adam Wright,et al.  An automated technique for identifying associations between medications, laboratory results and problems , 2010, J. Biomed. Informatics.

[37]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[38]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[39]  Stochastic Relaxation , 2014, Computer Vision, A Reference Guide.

[40]  Tzeng-Ji Chen,et al.  Prevalence of anti-ulcer drug use in a Chinese cohort. , 2003, World journal of gastroenterology.

[41]  Carol Friedman,et al.  Extracting Phenotypic Information from the Literature via Natural Language Processing , 2004, MedInfo.

[42]  Adam Wright,et al.  Development and evaluation of a crowdsourcing methodology for knowledge base construction: identifying relationships between clinical problems and medications , 2012, J. Am. Medical Informatics Assoc..

[43]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.