Clinical Word Sense Disambiguation with Interactive Search and Classification

Resolving word ambiguity in clinical text is critical for many natural language processing applications. Effective word sense disambiguation (WSD) systems rely on training a machine learning based classifier with abundant clinical text that is accurately annotated, the creation of which can be costly and time-consuming. We describe a double-loop interactive machine learning process, named ReQ-ReC (ReQuery-ReClassify), and demonstrate its effectiveness on multiple evaluation corpora. Using ReQ-ReC, a human expert first uses her domain knowledge to include sense-specific contextual words into the ReQuery loops and searches for instances relevant to the senses. Then, in the ReClassify loops, the expert only annotates the most ambiguous instances found by the current WSD model. Even with machine-generated queries only, the framework is comparable with or faster than current active learning methods in building WSD models. The process can be further accelerated when human experts use their domain knowledge to guide the search process.

[1]  Hongfang Liu,et al.  Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues , 2006, BMC Bioinformatics.

[2]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[3]  George Hripcsak,et al.  Automated encoding of clinical documents based on natural language processing. , 2004, Journal of the American Medical Informatics Association : JAMIA.

[4]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[5]  Christopher G. Chute,et al.  Word sense disambiguation across two domains: Biomedical literature and clinical notes , 2008, J. Biomed. Informatics.

[6]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[7]  Thomas C. Rindflesch,et al.  Query expansion using the UMLS Metathesaurus , 1997, AMIA.

[8]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[9]  Hongfang Liu,et al.  Research Paper: Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS , 2002, J. Am. Medical Informatics Assoc..

[10]  Hua Xu,et al.  Clinical entity recognition using structural support vector machines with rich features , 2012, DTMBIO '12.

[11]  Yue Wang,et al.  ReQ-ReC: high recall retrieval with query pooling and interactive classification , 2014, SIGIR.

[12]  Serguei V. S. Pakhomov,et al.  A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources , 2014, J. Am. Medical Informatics Assoc..

[13]  Thomas C. Rindflesch,et al.  Using Symbolic Knowledge in the UMLS to Disambiguate Words in Small Datasets with a Naïve Bayes Classifier , 2004, MedInfo.

[14]  Kai Zheng,et al.  Applying active learning to supervised word sense disambiguation in MEDLINE , 2013, J. Am. Medical Informatics Assoc..

[15]  Martijn J. Schuemie,et al.  Word Sense Disambiguation in the Biomedical Domain: An Overview , 2005, J. Comput. Biol..

[16]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[17]  Min Song,et al.  A prototype application for real-time recognition and disambiguation of clinical abbreviations , 2013, DTMBIO '13.

[18]  Hong Yu,et al.  Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles , 2007, J. Biomed. Informatics.

[19]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[20]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[21]  Bridget T. McInnes,et al.  Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation , 2011, BMC Bioinformatics.

[22]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[23]  Hinrich Schütze,et al.  Introduction to Information Retrieval: Preface , 2008 .

[24]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[25]  Kai Zheng,et al.  Supporting information retrieval from electronic health records: A report of University of Michigan's nine-year experience in developing and using the Electronic Medical Record Search Engine (EMERSE) , 2015, J. Biomed. Informatics.

[26]  Yue Wang,et al.  A User-in-the-Loop Process for Investigational Search: Foreseer in TREC 2013 Microblog Track , 2013, TREC.

[27]  Ted Pedersen,et al.  Abbreviation and Acronym Disambiguation in Clinical Discourse , 2005, AMIA.

[28]  Foster J. Provost,et al.  Inactive learning?: difficulties employing active learning in practice , 2011, SKDD.

[29]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[30]  Hongfang Liu,et al.  Research Paper: A Multi-aspect Comparison Study of Supervised Word Sense Disambiguation , 2004, J. Am. Medical Informatics Assoc..

[31]  Isabelle Guyon,et al.  Results of the Active Learning Challenge , 2011, Active Learning and Experimental Design @ AISTATS.