Extracting Smoking Status from Electronic Health Records Using NLP and Deep Learning.

Half a million people die every year from smoking-related issues across the United States. It is essential to identify individuals who are tobacco-dependent in order to implement preventive measures. In this study, we investigate the effectiveness of deep learning models to extract smoking status of patients from clinical progress notes. A Natural Language Processing (NLP) Pipeline was built that cleans the progress notes prior to processing by three deep neural networks: a CNN, a unidirectional LSTM, and a bidirectional LSTM. Each of these models was trained with a pre- trained or a post-trained word embedding layer. Three traditional machine learning models were also employed to compare against the neural networks. Each model has generated both binary and multi-class label classification. Our results showed that the CNN model with a pre-trained embedding layer performed the best for both binary and multi- class label classification.

[1]  Kenric W. Hammond,et al.  Copying and pasting of examinations within the electronic medical record , 2007, Int. J. Medical Informatics.

[2]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[3]  Chen Lin,et al.  Automatic Prediction of Rheumatoid Arthritis Disease Activity from the Electronic Medical Records , 2013, AMIA.

[4]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[5]  J. Henry,et al.  Adoption of Electronic Health Record Systems among U . S . Non-Federal Acute Care Hospitals : 2008-2015 , 2013 .

[6]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[7]  Abeed Sarker,et al.  Portable automatic text classification for adverse drug reaction detection via multi-corpus training , 2015, J. Biomed. Informatics.

[8]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[9]  H. Roncancio,et al.  Ceiling analysis of pedestrian recognition pipeline for an autonomous car application , 2013, 2013 IEEE Workshop on Robot Vision (WORV).

[10]  B. Lushniak,et al.  The Health consequences of smoking—50 years of progress : a report of the Surgeon General , 2014 .

[11]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[12]  J. Kazmierska,et al.  Application of the Naïve Bayesian Classifier to optimize treatment decisions. , 2008, Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology.

[13]  Sunghwan Sohn,et al.  Mayo Clinic Smoking Status Classification System: Extensions and Improvements , 2009, AMIA.

[14]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[15]  S. C. Kremer,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[16]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Isaac S. Kohane,et al.  Sentiment Measured in Hospital Discharge Notes Is Associated with Readmission and Mortality Risk: An Electronic Health Record Study , 2015, PloS one.

[19]  Walter F. Stewart,et al.  Doctor AI: Predicting Clinical Events via Recurrent Neural Networks , 2015, MLHC.

[20]  Yen S. Low,et al.  Text Mining for Adverse Drug Events: the Promise, Challenges, and State of the Art , 2014, Drug Safety.

[21]  I. Kohane,et al.  Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts , 2015, PloS one.

[22]  Girish Chavan,et al.  NOBLE – Flexible concept recognition for large-scale biomedical natural language processing , 2016, BMC Bioinformatics.

[23]  Xiaolong Wang,et al.  Effects of Semantic Features on Machine Learning-Based Drug Name Recognition Systems: Word Embeddings vs. Manually Constructed Dictionaries , 2015, Inf..

[24]  Daniel J. Pallin,et al.  Estimates of Electronic Medical Records in U.S. Emergency Departments , 2010, PloS one.

[25]  Hongfang Liu,et al.  Research and applications: Patient-level temporal aggregation for text-based asthma status ascertainment , 2014, J. Am. Medical Informatics Assoc..

[26]  Xiaolong Wang,et al.  Drug-Drug Interaction Extraction via Convolutional Neural Networks , 2016, Comput. Math. Methods Medicine.

[27]  Xiaolong Wang,et al.  Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks , 2014, BioMed research international.

[28]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[29]  Joseph T. Lariscy Smoking-attributable mortality by cause of death in the United States: An indirect approach , 2019, SSM - population health.

[30]  Jimeng Sun,et al.  Automatic identification of heart failure diagnostic criteria, using text analysis of clinical notes from electronic health records , 2014, Int. J. Medical Informatics.