论文信息 - Combining an Expert-Based Medical Entity Recognizer to a Machine-Learning System: Methods and a Case Study - 字舞流文

Combining an Expert-Based Medical Entity Recognizer to a Machine-Learning System: Methods and a Case Study

Medical entity recognition is currently generally performed by data-driven methods based on supervised machine learning. Expert-based systems, where linguistic and domain expertise are directly provided to the system are often combined with data-driven systems. We present here a case study where an existing expert-based medical entity recognition system, Ogmios, is combined with a data-driven system, Caramba, based on a linear-chain Conditional Random Field (CRF) classifier. Our case study specifically highlights the risk of overfitting incurred by an expert-based system. We observe that it prevents the combination of the 2 systems from obtaining improvements in precision, recall, or F-measure, and analyze the underlying mechanisms through a post-hoc feature-level analysis. Wrapping the expert-based system alone as attributes input to a CRF classifier does boost its F-measure from 0.603 to 0.710, bringing it on par with the data-driven system. The generalization of this method remains to be further investigated.

T. Lavergne | Pierre Zweigenbaum | Cyril Grouin | S. Rosset | N. Grabar | Thierry Hamon

[1] L. Brooke. The National Library of Medicine. , 1980, Hospital libraries.

[2] C. Friedman,et al. Medical Language Processing: Computer Management of Narrative Data , 1987 .

[3] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[4] Carol Friedman,et al. Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[5] Helmut Schmidt,et al. Probabilistic part-of-speech tagging using decision trees , 1994 .

[6] Ralph Grishman,et al. Message Understanding Conference- 6: A Brief History , 1996, COLING.

[7] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8] Olivier Bodenreider,et al. The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[9] Percy Liang,et al. Semi-Supervised Learning for Natural Language , 2005 .

[10] Sophia Ananiadou,et al. Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[11] Thierry Hamon,et al. Improving Term Extraction with Terminological Resources , 2006, FinTAL.

[12] Andrew McCallum,et al. An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[13] John F. Hurdle,et al. Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[14] Concurrent linguistic annotations for identifying medication names and the related information in discharge summaries , 2009 .

[15] O. Galibert. Approches et méthodologies pour la réponse automatique à des questions adaptées à un cadre interactif en domaine ouvert , 2009 .

[16] Sunghwan Sohn,et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[17] Natalia Grabar,et al. Linguistic approach for identification of medication names and related information in clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[18] Alan R. Aronson,et al. An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[19] F. Rudzicz. Human Language Technologies : The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , 2010 .

[20] François Yvon,et al. Practical Very Large Scale CRFs , 2010, ACL.

[21] Shuying Shen,et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[22] Pierre Zweigenbaum,et al. Hybrid methods for improving information access in clinical documents: concept, assertion, and relation identification , 2011, J. Am. Medical Informatics Assoc..

[23] Shuying Shen,et al. Evaluating the state of the art in coreference resolution for electronic medical records , 2012, J. Am. Medical Informatics Assoc..

[24] François Yvon,et al. Repérage des entités nommées pour l’arabe : adaptation non-supervisée et combinaison de systèmes (Named Entity Recognition for Arabic : Unsupervised adaptation and Systems combination) [in French] , 2012, JEP/TALN/RECITAL.