A preliminary study on automatic identification of patient smoking status in unstructured electronic health records

Identifying smoking status of patients is vital for assessing their risk for a disease. With the rapid adoption of electronic health records (EHRs), patient information is scattered across various systems in the form of structured and unstructured data. In this study, we aimed to develop a hybrid system using rule-based, unsupervised and supervised machine learning techniques to automatically identify the smoking status of patients in unstructured EHRs. In addition to traditional features, we used per-document topic model distribution weights as features in our system. We also discuss the performance of our hybrid system using different feature sets. Our preliminary results demonstrated that combining per-document topic model distribution weights with traditional features improve the overall performance of the system.

[1]  Hua Xu,et al.  Research and applications: ICD-9 tobacco use codes are effective identifiers of smoking status , 2013, J. Am. Medical Informatics Assoc..

[2]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[3]  Chih-Wei Chen,et al.  A context-aware approach for progression tracking of medical concepts in electronic medical records , 2015, J. Biomed. Informatics.

[4]  Cheng-Chew Lim,et al.  Dual /spl nu/-support vector machine with error rate and training size biasing , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Brian Wilson,et al.  Case Report: Identifying Smokers with a Medical Extraction System , 2008, J. Am. Medical Informatics Assoc..

[6]  Özlem Uzuner,et al.  Annotating risk factors for heart disease in clinical narratives for diabetic patients , 2015, J. Biomed. Informatics.

[7]  William Rose,et al.  Practical implementation of an existing smoking detection pipeline and reduced support vector machine training corpus requirements , 2014, J. Am. Medical Informatics Assoc..

[8]  Xiaodong Zhu,et al.  The Coronary Artery , 2015 .

[9]  Pradeep Kumar Ray,et al.  Coronary artery disease risk assessment from unstructured electronic health records using text mining , 2015, J. Biomed. Informatics.

[10]  Mary P. Galea,et al.  Improving Health Management through Clinical Decision Support Systems , 2015 .

[11]  Yuan Luo,et al.  Identifying patient smoking status from medical discharge records. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[14]  Duy Duc An Bui,et al.  Research and applications: Learning regular expressions for clinical text classification , 2014, J. Am. Medical Informatics Assoc..

[15]  Hong-Jie Dai,et al.  Mining Electronic Health Records to Guide and Support Clinical Decision Support Systems , 2016 .

[16]  Aaron M. Cohen,et al.  Case Report: Five-way Smoking Status Classification Using Text Hot-Spot Identification and Error-correcting Output Codes , 2008, J. Am. Medical Informatics Assoc..

[17]  Christopher G. Chute,et al.  Technical Brief: Mayo Clinic NLP System for Patient Smoking Status Identification , 2008, J. Am. Medical Informatics Assoc..

[18]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.