Combining an Expert-Based Medical Entity Recognizer to a Machine-Learning System: Methods and a Case Study

Medical entity recognition is currently generally performed by data-driven methods based on supervised machine learning. Expert-based systems, where linguistic and domain expertise are directly provided to the system are often combined with data-driven systems. We present here a case study where an existing expert-based medical entity recognition system, Ogmios, is combined with a data-driven system, Caramba, based on a linear-chain Conditional Random Field (CRF) classifier. Our case study specifically highlights the risk of overfitting incurred by an expert-based system. We observe that it prevents the combination of the 2 systems from obtaining improvements in precision, recall, or F-measure, and analyze the underlying mechanisms through a post-hoc feature-level analysis. Wrapping the expert-based system alone as attributes input to a CRF classifier does boost its F-measure from 0.603 to 0.710, bringing it on par with the data-driven system. The generalization of this method remains to be further investigated.

[1]  L. Brooke The National Library of Medicine. , 1980, Hospital libraries.

[2]  C. Friedman,et al.  Medical Language Processing: Computer Management of Narrative Data , 1987 .

[3]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[4]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[5]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[6]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[9]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[10]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[11]  Thierry Hamon,et al.  Improving Term Extraction with Terminological Resources , 2006, FinTAL.

[12]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[13]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[14]  Concurrent linguistic annotations for identifying medication names and the related information in discharge summaries , 2009 .

[15]  O. Galibert Approches et méthodologies pour la réponse automatique à des questions adaptées à un cadre interactif en domaine ouvert , 2009 .

[16]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[17]  Natalia Grabar,et al.  Linguistic approach for identification of medication names and related information in clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[18]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[19]  F. Rudzicz Human Language Technologies : The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , 2010 .

[20]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[21]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[22]  Pierre Zweigenbaum,et al.  Hybrid methods for improving information access in clinical documents: concept, assertion, and relation identification , 2011, J. Am. Medical Informatics Assoc..

[23]  Shuying Shen,et al.  Evaluating the state of the art in coreference resolution for electronic medical records , 2012, J. Am. Medical Informatics Assoc..

[24]  François Yvon,et al.  Repérage des entités nommées pour l’arabe : adaptation non-supervisée et combinaison de systèmes (Named Entity Recognition for Arabic : Unsupervised adaptation and Systems combination) [in French] , 2012, JEP/TALN/RECITAL.