Modeling heterogeneous clinical sequence data in semantic space for adverse drug event detection

The enormous amounts of data that are continuously recorded in electronic health record systems offer ample opportunities for data science applications to improve healthcare. There are, however, challenges involved in using such data for machine learning, such as high dimensionality and sparsity, as well as an inherent heterogeneity that does not allow the distinct types of clinical data to be treated in an identical manner. On the other hand, there are also similarities across data types that may be exploited, e.g., the possibility of representing some of them as sequences. Here, we apply the notions underlying distributional semantics, i.e., methods that model the meaning of words in semantic (vector) space on the basis of co-occurrence information, to four distinct types of clinical data: free-text notes, on the one hand, and clinical events, in the form of diagnosis codes, drug codes and measurements, on the other hand. Each semantic space contains continuous vector representations for every unique word and event, which can then be used to create representations of, e.g., care episodes that, in turn, can be exploited by the learning algorithm. This approach does not only reduce sparsity, but also takes into account, and explicitly models, similarities between various items, and it does so in an entirely data-driven fashion. Here, we report on a series of experiments using the random forest learning algorithm that demonstrate the effectiveness, in terms of accuracy and area under ROC curve, of the proposed representation form over the commonly used bag-of-items counterpart. The experiments are conducted on 27 real datasets that each involves the (binary) classification task of detecting a particular adverse drug event. It is also shown that combining structured and unstructured data leads to significant improvements over using only one of them.

[1]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[2]  M. Pirmohamed,et al.  Which drugs cause preventable admissions to hospital? A systematic review. , 2007, British journal of clinical pharmacology.

[3]  Jing Zhao,et al.  Dimensionality Reduction with Random Indexing: An Application on Adverse Drug Event Detection Using Electronic Health Records , 2014, 2014 IEEE 27th International Symposium on Computer-Based Medical Systems.

[4]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[5]  Chris Eliasmith,et al.  Integrating Structure and Meaning: A New Method for Encoding Structure for Text Classification , 2008, ECIR.

[6]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[7]  T. Werge,et al.  Dose-Specific Adverse Drug Reaction Identification in Electronic Patient Records: Temporal Data Mining in an Inpatient Psychiatric Population , 2014, Drug Safety.

[8]  Robert Eriksson,et al.  Dictionary construction and identification of possible adverse drug events in Danish clinical narrative text , 2013, J. Am. Medical Informatics Assoc..

[9]  S. Schroeder,et al.  How Many Hours Is Enough? An Old Profession Meets a New Generation , 2004, Annals of Internal Medicine.

[10]  Brian Edwards,et al.  Postmarketing Safety Surveillance , 2010, Pharmaceutical Medicine.

[11]  Jing Zhao,et al.  Detecting Adverse Drug Events Using Concept Hierarchies of Clinical Codes , 2014, 2014 IEEE International Conference on Healthcare Informatics.

[12]  Rickard Cöster,et al.  Using Bag-of-Concepts to Improve the Performance of Support Vector Machines in Text Categorization , 2004, COLING.

[13]  Robert Östling,et al.  Stagger: an Open-Source Part of Speech Tagger for Swedish , 2013 .

[14]  S. Goldman,et al.  Limitations and strengths of spontaneous reports data. , 1998, Clinical therapeutics.

[15]  Miriam Sturkenboom,et al.  Postmarketing Safety Surveillance , 2013, Drug Safety.

[16]  Bertram Pitt,et al.  Withdrawal of cerivastatin from the world market , 2001, Current controlled trials in cardiovascular medicine.

[17]  Maria Kvist,et al.  Identifying adverse drug event information in clinical notes with distributional semantic representations of context , 2015, J. Biomed. Informatics.

[18]  Graciela Gonzalez-Hernandez,et al.  Utilizing social media data for pharmacovigilance: A review , 2015, J. Biomed. Informatics.

[19]  Régis Beuscart,et al.  Data Mining to Generate Adverse Drug Events Detection Rules , 2011, IEEE Transactions on Information Technology in Biomedicine.

[20]  Barbara Sibbald,et al.  Rofecoxib (Vioxx) voluntarily withdrawn from market , 2004, Canadian Medical Association Journal.

[21]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[22]  P. Barach,et al.  Clarifying Adverse Drug Events: A Clinician's Guide to Terminology, Documentation, and Reporting , 2004, Annals of Internal Medicine.

[23]  P. Maurette [To err is human: building a safer health system]. , 2002, Annales francaises d'anesthesie et de reanimation.

[24]  Jing Zhao,et al.  Predicting Adverse Drug Events by Analyzing Electronic Patient Records , 2013, AIME.

[25]  Hercules Dalianis,et al.  Stockholm EPR Corpus : A Clinical Database Used to Improve Health Care , 2012 .

[26]  Carol Friedman,et al.  Mining electronic health records for adverse drug effects using regression based methods , 2010, IHI.

[27]  Jing Zhao,et al.  Cascading adverse drug event detection in electronic health records , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[28]  Jürgen Stausberg,et al.  Drug-related admissions and hospital-acquired adverse drug events in Germany: a longitudinal analysis from 2003 to 2007 of ICD-10-coded routine data , 2011, BMC health services research.

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  Jing Zhao,et al.  Detecting adverse drug events with multiple representations of clinical measurements , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[31]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[32]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[33]  Fei Wang,et al.  From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records , 2014, KDD.

[34]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[35]  Aron Henriksson,et al.  Semantic Spaces of Clinical Text : Leveraging Distributional Semantics for Natural Language Processing of Electronic Health Records , 2013 .

[36]  Peter Szolovits,et al.  A Multivariate Timeseries Modeling Approach to Severity of Illness Assessment and Forecasting in ICU with Sparse, Heterogeneous Clinical Data , 2015, AAAI.

[37]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[38]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[39]  Maria Kvist,et al.  Exploration of Adverse Drug Reactions in Semantic Vector Space Models of Clinical Text , 2012, ICML 2012.

[40]  N. Shah,et al.  Pharmacovigilance Using Clinical Notes , 2013, Clinical pharmacology and therapeutics.

[41]  Hugo Jair Escalante,et al.  Distributional Term Representations for Short-Text Categorization , 2013, CICLing.

[42]  L. Hazell,et al.  Under-Reporting of Adverse Drug Reactions , 2006, Drug safety.