Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries

OBJECTIVE A system that translates narrative text in the medical domain into structured representation is in great demand. The system performs three sub-tasks: concept extraction, assertion classification, and relation identification. DESIGN The overall system consists of five steps: (1) pre-processing sentences, (2) marking noun phrases (NPs) and adjective phrases (APs), (3) extracting concepts that use a dosage-unit dictionary to dynamically switch two models based on Conditional Random Fields (CRF), (4) classifying assertions based on voting of five classifiers, and (5) identifying relations using normalized sentences with a set of effective discriminating features. MEASUREMENTS Macro-averaged and micro-averaged precision, recall and F-measure were used to evaluate results. RESULTS The performance is competitive with the state-of-the-art systems with micro-averaged F-measure of 0.8489 for concept extraction, 0.9392 for assertion classification and 0.7326 for relation identification. CONCLUSIONS The system exploits an array of common features and achieves state-of-the-art performance. Prudent feature engineering sets the foundation of our systems. In concept extraction, we demonstrated that switching models, one of which is especially designed for telegraphic sentences, improved extraction of the treatment concept significantly. In assertion classification, a set of features derived from a rule-based classifier were proven to be effective for the classes such as conditional and possible. These classes would suffer from data scarcity in conventional machine-learning methods. In relation identification, we use two-staged architecture, the second of which applies pairwise classifiers to possible candidate classes. This architecture significantly improves performance.

[1]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[2]  Naomi Sager,et al.  Research Paper: Natural Language Processing and the Representation of Clinical Data , 1994, J. Am. Medical Informatics Assoc..

[3]  Dan Roth,et al.  Exploiting Background Knowledge for Relation Extraction , 2010, COLING.

[4]  Daniel Dominic Sleator,et al.  Link Grammar Parser , 2000 .

[5]  Tawanda C. Sibanda,et al.  Was the Patient Cured? Understanding Semantic Categories and Their Relationships in Patient Records , 2006 .

[6]  Laurianne Sitbon,et al.  Rule-based approach for identifying assertions in clinical free-text data , 2010, ADCS 2010.

[7]  Peter Szolovits,et al.  Adding a Medical Lexicon to an English Parser , 2003, AMIA.

[8]  David G. Stork,et al.  Pattern Classification , 1973 .

[9]  Junichi Tsujii,et al.  Event extraction for systems biology by text mining the literature. , 2010, Trends in biotechnology.

[10]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[11]  Peter Spyns Natural Language Processing in Medicine: An Overview , 1996, Methods of Information in Medicine.

[12]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[13]  Sampo Pyysalo,et al.  A Comparative Study of Syntactic Parsers for Event Extraction , 2010, BioNLP@ACL.

[14]  H. White,et al.  Logistic regression in the medical literature: standards for use and reporting, with particular attention to one medical domain. , 2001, Journal of clinical epidemiology.

[15]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[16]  Vasudevan Jagannathan,et al.  Assessment of commercial NLP engines for medication information extraction from dictated clinical notes , 2009, Int. J. Medical Informatics.

[17]  Pierre Zweigenbaum,et al.  Hybrid methods for improving information access in clinical documents: concept, assertion, and relation identification , 2011, J. Am. Medical Informatics Assoc..

[18]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[19]  Pierre Zweigenbaum,et al.  Extracting medical information from narrative patient records: the case of medication-related information , 2010, J. Am. Medical Informatics Assoc..

[20]  Fei Xia,et al.  Community annotation experiment for ground truth generation for the i2b2 medication challenge , 2010, J. Am. Medical Informatics Assoc..

[21]  van Gerardus Noord,et al.  Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) , 2010 .

[22]  Alexander Clark,et al.  Combining Distributional and Morphological Information for Part of Speech Induction , 2003, EACL.

[23]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[24]  Sanda M. Harabagiu,et al.  Automatic extraction of relations between medical concepts in clinical texts , 2011, J. Am. Medical Informatics Assoc..

[25]  Jun'ichi Tsujii,et al.  Event Extraction with Complex Event Classification Using Rich Features , 2010, J. Bioinform. Comput. Biol..

[26]  Yuan Luo,et al.  Identifying patient smoking status from medical discharge records. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[27]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[28]  Özlem Uzuner,et al.  Viewpoint Paper: Recognizing Obesity and Comorbidities in Sparse Data , 2009, J. Am. Medical Informatics Assoc..

[29]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[30]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.