Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications

We aim to build and evaluate an open-source natural language processing system for information extraction from electronic medical record clinical free-text. We describe and evaluate our system, the clinical Text Analysis and Knowledge Extraction System (cTAKES), released open-source at http://www.ohnlp.org. The cTAKES builds on existing open-source technologies-the Unstructured Information Management Architecture framework and OpenNLP natural language processing toolkit. Its components, specifically trained for the clinical domain, create rich linguistic and semantic annotations. Performance of individual components: sentence boundary detector accuracy=0.949; tokenizer accuracy=0.949; part-of-speech tagger accuracy=0.936; shallow parser F-score=0.924; named entity recognizer and system-level evaluation F-score=0.715 for exact and 0.824 for overlapping spans, and accuracy for concept mapping, negation, and status attributes for exact and overlapping spans of 0.957, 0.943, 0.859, and 0.580, 0.939, and 0.839, respectively. Overall performance is discussed against five applications. The cTAKES annotations are the foundation for methods and modules for higher-level semantic processing of clinical free-text.

[1]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[2]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[3]  N. Enzer,et al.  BOOK AND MEDIA REVIEW , 1990 .

[4]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5]  P J Haug,et al.  Experience with a mixed semantic/syntactic parser. , 1995, Proceedings. Symposium on Computer Applications in Medical Care.

[6]  Ann Bies,et al.  Bracketing Guidelines For Treebank II Style Penn Treebank Project , 1995 .

[7]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[8]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[9]  Carol Friedman,et al.  Towards a comprehensive medical language processing system: methods and issues , 1997, AMIA.

[10]  G. Hripcsak,et al.  Extracting Findings from Narrative Reports: Software Transferability and Sources of Physician Disagreement , 1998, Methods of Information in Medicine.

[11]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[12]  Peter J. Haug,et al.  Automatic extraction of PIOPED interpretations from ventilation/perfusion lung scan reports , 1998, AMIA.

[13]  Renata Vieira,et al.  A Corpus-based Investigation of Definite Description Use , 1997, CL.

[14]  S. Nightingale Electronic Orange Book , 1999 .

[15]  Carol Friedman,et al.  A broad-coverage natural language processing system , 2000, AMIA.

[16]  Peter J. Haug,et al.  Research Paper: Automatic Detection of Acute Bacterial Pneumonia from Chest X-ray Reports , 2000, J. Am. Medical Informatics Assoc..

[17]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[18]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[19]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[20]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[21]  Peter J. Haug,et al.  MPLUS: a probabilistic medical language understanding system , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[22]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[23]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[24]  R. Califf,et al.  Health Insurance Portability and Accountability Act (HIPAA): must there be a trade-off between privacy and quality of health care, or can we advance both? , 2003, Circulation.

[25]  Mary F. Wisniewski,et al.  Electronic Interpretation of Chest Radiograph Reports to Detect Central Venous Catheters , 2003, Infection Control & Hospital Epidemiology.

[26]  Olivier Bodenreider,et al.  Exploring semantic groups through visual approaches , 2003, J. Biomed. Informatics.

[27]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[28]  James W. Cooper,et al.  Text analytics for life science using the Unstructured Information Management Architecture , 2004, IBM Syst. J..

[29]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[30]  R. Engelbrecht,et al.  Connecting medical informatics and bio-informatics : proceedings of MIE2005 : the XIXth International Congress of the European Federation for Medical Informatics , 2005 .

[31]  K. Bretonnel Cohen,et al.  Corpus Design for Biomedical Natural Language Processing , 2005, LBLODMBS@IDMB.

[32]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[33]  Martijn J. Schuemie,et al.  Word Sense Disambiguation in the Biomedical Domain: An Overview , 2005, J. Comput. Biol..

[34]  K. Bretonnel Cohen,et al.  Empirical data on corpus design and usage in biomedical natural language processing , 2005, AMIA.

[35]  Christopher G. Chute,et al.  Domain-specific language models and lexicons for tagging , 2005, J. Biomed. Informatics.

[36]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[37]  U. Hahn,et al.  Automatically Adapting an NLP Core Engine to the Biology Domain , 2006 .

[38]  Scott T. Weiss,et al.  Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system , 2006, BMC Medical Informatics Decis. Mak..

[39]  Wendy W. Chapman,et al.  Methods Paper: Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger , 2007, J. Am. Medical Informatics Assoc..

[40]  Guergana K. Savova,et al.  System Evaluation on a Named Entity Corpus from Clinical Notes , 2008, LREC.

[41]  Christopher G. Chute,et al.  Constructing Evaluation Corpora for Automated Clinical Named Entity Recognition , 2008, LREC.

[42]  Christopher G. Chute,et al.  Technical Brief: Mayo Clinic NLP System for Patient Smoking Status Identification , 2008, J. Am. Medical Informatics Assoc..

[43]  Natalia Grabar,et al.  The Second i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data , 2008 .

[44]  Christopher G. Chute,et al.  Word sense disambiguation across two domains: Biomedical literature and clinical notes , 2008, J. Biomed. Informatics.

[45]  Ben Wellner,et al.  The Mayo/MITRE System for Discovery of Obesity and Its Comorbidities , 2008 .

[46]  Yuan Luo,et al.  Identifying patient smoking status from medical discharge records. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[47]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[48]  Guergana K. Savova,et al.  Discerning Tumor Status from Unstructured MRI Reports—Completeness of Information in Existing Reports and Utility of Automated Natural Language Processing , 2009, Journal of Digital Imaging.

[49]  J. Hornberger Electronic Health Records: A Guide for Clinicians and Administrators , 2009 .

[50]  Sunghwan Sohn,et al.  Mayo Clinic Smoking Status Classification System: Extensions and Improvements , 2009, AMIA.

[51]  James W. Cooper,et al.  Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model , 2009, J. Biomed. Informatics.

[52]  Özlem Uzuner Viewpoint Paper: Recognizing Obesity and Comorbidities in Sparse Data , 2009, J. Am. Medical Informatics Assoc..

[53]  Wayne H. Ward,et al.  Towards Temporal Relation Discovery from the Clinical Narrative , 2009, AMIA.

[54]  K. Bretonnel Cohen,et al.  U-Compare: share and compare text mining tools with UIMA , 2009, Bioinform..

[55]  Michael Feldman,et al.  caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research , 2010, J. Am. Medical Informatics Assoc..

[56]  Tim Benson,et al.  Clinical Document Architecture , 2010 .

[57]  Lynda L. McGhie,et al.  THE HEALTH INSURANCE PORTABILITY AND ACCOUNTABILITY ACT , 2004 .