Use of Semantic Features to Classify Patient Smoking Status

The recent i2b2 NLP Challenge smoking classification task offers a rare chance to compare different natural language processing techniques on actual clinical data. We compare the performance of a classifier which relies on semantic features generated by an unmodified version of MedLEE, a clinical NLP engine, to one using lexical features. We also compare the performance of supervised classifiers to rule-based symbolic classifiers. Our baseline supervised classifier with lexical features yields a microaveraged F-measure of 0.81. Our rule-based classifier using MedLEE semantic features is superior, with an F-measure of 0.83. Our supervised classifier trained with semantic MedLEE features is competitive with the top-performing smoking classifier in the i2b2 NLP Challenge, with microaveraged precision of 0.90, recall of 0.89, and F-measure of 0.89.

[1]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[2]  Yuan Luo,et al.  Identifying patient smoking status from medical discharge records. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[3]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[4]  Matthew R. Sydes,et al.  Technical Brief: Using Implicit Information to Identify Smoking Status in Smoke-blind Medical Discharge Summaries , 2008, J. Am. Medical Informatics Assoc..

[5]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[6]  Christopher G. Chute,et al.  Technical Brief: Mayo Clinic NLP System for Patient Smoking Status Identification , 2008, J. Am. Medical Informatics Assoc..

[7]  Aaron M. Cohen,et al.  Case Report: Five-way Smoking Status Classification Using Text Hot-Spot Identification and Error-correcting Output Codes , 2008, J. Am. Medical Informatics Assoc..

[8]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[9]  Carol Friedman,et al.  Extracting Phenotypic Information from the Literature via Natural Language Processing , 2004, MedInfo.

[10]  K. Ohe,et al.  Patient Status Classification by using Rule based Sentence Extraction and BM 25-kNN based Classifier , 2006 .

[11]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[12]  Brian Wilson,et al.  Case Report: Identifying Smokers with a Medical Extraction System , 2008, J. Am. Medical Informatics Assoc..

[13]  J. Csirik,et al.  Automatic extraction of semantic content from medical discharge records , 2006 .

[14]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[15]  State-specific prevalence of current cigarette smoking among adults and secondhand smoke rules and policies in homes and workplaces--United States, 2005. , 2006, MMWR. Morbidity and mortality weekly report.

[16]  George Hripcsak,et al.  Automated encoding of clinical documents based on natural language processing. , 2004, Journal of the American Medical Informatics Association : JAMIA.

[17]  Peter D. Stetson,et al.  Model Formulation: An Electronic Health Record Based on Structured Narrative , 2008, J. Am. Medical Informatics Assoc..