Combining Structured and Free Textual Data of Diabetic Patients' Smoking Status

The main goal of this research is to identify and extract risk factors for Diabetes Mellitus. The data source for our experiments are 8 mln outpatient records from the Bulgarian Diabetes Registry submitted to the Bulgarian Health Insurance Fund by general practitioners and all kinds of professionals during 2014. In this paper we report our work on automatic identification of the patients’ smoking status. The experiments are performed on free text sections of a randomly extracted subset of the registry outpatient records. Although no rich semantic resources for Bulgarian exist, we were able to enrich our model with semantic features based on categorical vocabularies. In addition to the automatically labeled records we use the records form the Diabetes register that contain diagnoses related to tobacco usage. Finally, a combined result from structured information (ICD-10 codes) and extracted data about the smoking status is associated with each patient. The reported accuracy of the best model is comparable to the highest results reported at the i2b2 Challenge 2006. These method is ready to be validated on big data after minor improvements.

[1]  Yuan Luo,et al.  Identifying patient smoking status from medical discharge records. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[2]  Barbara Kocurek,et al.  Chronic Disease Management for Diabetes: Baylor Health Care System's Coordinated Efforts and the Opening of the Diabetes Health and Wellness Institute , 2010, Proceedings.

[3]  Galia Angelova,et al.  Mining Clinical Events to Reveal Patterns and Sequences , 2016 .

[4]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[5]  Preslav Nakov,et al.  Building an inflectional stemmer for Bulgarian , 2003, CompSysTech '03.

[6]  Brian Wilson,et al.  Case Report: Identifying Smokers with a Medical Extraction System , 2008, J. Am. Medical Informatics Assoc..

[7]  Hua Xu,et al.  Research and applications: ICD-9 tobacco use codes are effective identifiers of smoking status , 2013, J. Am. Medical Informatics Assoc..

[8]  Petya Osenova,et al.  Using the linguistic knowledge in BulTreeBank for the selection of the correct parses , 2010 .

[9]  Aaron M. Cohen,et al.  Case Report: Five-way Smoking Status Classification Using Text Hot-Spot Identification and Error-correcting Output Codes , 2008, J. Am. Medical Informatics Assoc..

[10]  Pradeep Kumar Ray,et al.  A preliminary study on automatic identification of patient smoking status in unstructured electronic health records , 2015, BioNLP@IJCNLP.

[11]  Galia Angelova,et al.  Applying Language Technologies on Healthcare Patient Records for Better Treatment of Bulgarian Diabetic Patients , 2014, AIMSA.

[12]  Özlem Uzuner,et al.  Annotating risk factors for heart disease in clinical narratives for diabetic patients , 2015, J. Biomed. Informatics.

[13]  Galia Angelova,et al.  Text Mining and Big Data Analytics for Retrospective Analysis of Clinical Texts from Outpatient Care , 2015 .