Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study

Automatic recognition of clinical entities in the narrative text of health records is useful for constructing applications for documentation of patient care, as well as for secondary usage in the form of medical knowledge extraction. There are a number of named entity recognition studies on English clinical text, but less work has been carried out on clinical text in other languages. This study was performed on Swedish health records, and focused on four entities that are highly relevant for constructing a patient overview and for medical hypothesis generation, namely the entities: Disorder, Finding, Pharmaceutical Drug and Body Structure. The study had two aims: to explore how well named entity recognition methods previously applied to English clinical text perform on similar texts written in Swedish; and to evaluate whether it is meaningful to divide the more general category Medical Problem, which has been used in a number of previous studies, into the two more granular entities, Disorder and Finding. Clinical notes from a Swedish internal medicine emergency unit were annotated for the four selected entity categories, and the inter-annotator agreement between two pairs of annotators was measured, resulting in an average F-score of 0.79 for Disorder, 0.66 for Finding, 0.90 for Pharmaceutical Drug and 0.80 for Body Structure. A subset of the developed corpus was thereafter used for finding suitable features for training a conditional random fields model. Finally, a new model was trained on this subset, using the best features and settings, and its ability to generalise to held-out data was evaluated. This final model obtained an F-score of 0.81 for Disorder, 0.69 for Finding, 0.88 for Pharmaceutical Drug, 0.85 for Body Structure and 0.78 for the combined category Disorder+Finding. The obtained results, which are in line with or slightly lower than those for similar studies on English clinical text, many of them conducted using a larger training data set, show that the approaches used for English are also suitable for Swedish clinical text. However, a small proportion of the errors made by the model are less likely to occur in English text, showing that results might be improved by further tailoring the system to clinical Swedish. The entity recognition results for the individual entities Disorder and Finding show that it is meaningful to separate the general category Medical Problem into these two more granular entity types, e.g. for knowledge mining of co-morbidity relations and disorder-finding relations.

[1]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[2]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[3]  Son Doan,et al.  Integrating existing natural language processing tools for medication extraction from discharge summaries , 2010, J. Am. Medical Informatics Assoc..

[4]  Min Li,et al.  High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge , 2010, J. Am. Medical Informatics Assoc..

[5]  Maria Kvist,et al.  Modeling human comprehension of Swedish medical records for intelligent access and summarization systems - Future vision, a physician's perspective , 2011 .

[6]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[9]  Son Doan,et al.  Recognition of medication information from discharge summaries using ensembles of classifiers , 2012, BMC Medical Informatics and Decision Making.

[10]  Fei Xia,et al.  Community annotation experiment for ground truth generation for the i2b2 medication challenge , 2010, J. Am. Medical Informatics Assoc..

[11]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[12]  Søren Brunak,et al.  Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts , 2011, PLoS Comput. Biol..

[13]  Yefeng Wang,et al.  Cascading Classifiers for Named Entity Recognition in Clinical Notes , 2009, BiomedicalIE@RANLP.

[14]  Angus Roberts,et al.  Combining Terminology Resources and Statistical Methods for Entity Recognition: an Evaluation , 2008, LREC.

[15]  Guergana K. Savova,et al.  System Evaluation on a Named Entity Corpus from Clinical Notes , 2008, LREC.

[16]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[17]  Martin Gellerstam,et al.  The Bank of Swedish , 2000, LREC.

[18]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[19]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[20]  Joel D. Martin,et al.  Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010 , 2011, J. Am. Medical Informatics Assoc..

[21]  H. Dalianis,et al.  Calculating Prevalence of Comorbidity and Comorbidity Combinations with Diabetes in Hospital Care in Sweden Using a Health Care Record Database , 2011 .

[22]  Rodney D. Nielsen,et al.  Towards comprehensive syntactic and semantic annotations of the clinical narrative , 2013, J. Am. Medical Informatics Assoc..

[23]  Philip V. Ogren,et al.  Knowtator: A Protégé plug-in for annotated corpus construction , 2006, NAACL.

[24]  Peter J. Haug,et al.  Classifying free-text triage chief complaints into syndromic categories with natural language processing , 2005, Artif. Intell. Medicine.

[25]  Christopher G. Chute,et al.  Constructing Evaluation Corpora for Automated Clinical Named Entity Recognition , 2008, LREC.

[26]  Maria Kvist,et al.  Rule-based Entity Recognition and Coverage of SNOMED CT in Swedish Clinical Text , 2012, LREC.

[27]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[28]  H. Dalianis,et al.  The Stockholm EPR Corpus – Characteristics and Some Initial Findings , 2009 .

[29]  Johan Carlberger,et al.  Implementing an efficient part‐of‐speech tagger , 1999 .

[30]  Angus Roberts,et al.  Building a semantically annotated corpus of clinical texts , 2009, J. Biomed. Informatics.

[31]  Maria Kvist,et al.  Entity Recognition of Pharmaceutical Drugs in Swedish Clinical Text , 2012 .

[32]  Robert Eriksson,et al.  Dictionary construction and identification of possible adverse drug events in Danish clinical narrative text , 2013, J. Am. Medical Informatics Assoc..

[33]  Dimitrios Kokkinakis,et al.  Identification of Entity References in Hospital Discharge Letters , 2007, NODALIDA.

[34]  Yefeng Wang,et al.  Annotating and Recognising Named Entities in Clinical Notes , 2009, ACL.

[35]  R. Power,et al.  Summarisation and Visualisation of e-Health Data Repositories , 2005 .

[36]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[37]  George Hripcsak,et al.  Mining a clinical data warehouse to discover disease-finding associations using co-occurrence statistics , 2005, AMIA.

[38]  J. Wade Davis,et al.  Medical Statistics: A Textbook for the Health Sciences , 2008 .

[39]  George Hripcsak,et al.  Evaluation of training with an annotation schema for manual annotation of clinical conditions from emergency department reports , 2008, Int. J. Medical Informatics.