Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation

BackgroundEvaluation of Word Sense Disambiguation (WSD) methods in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We present a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE. We demonstrate the use of this method by developing such a data set, called MSH WSD.MethodsIn our method, the Metathesaurus is first screened to identify ambiguous terms whose possible senses consist of two or more MeSH headings. We then use each ambiguous term and its corresponding MeSH heading to extract MEDLINE citations where the term and only one of the MeSH headings co-occur. The term found in the MEDLINE citation is automatically assigned the UMLS CUI linked to the MeSH heading. Each instance has been assigned a UMLS Concept Unique Identifier (CUI). We compare the characteristics of the MSH WSD data set to the previously existing NLM WSD data set.ResultsThe resulting MSH WSD data set consists of 106 ambiguous abbreviations, 88 ambiguous terms and 9 which are a combination of both, for a total of 203 ambiguous entities. For each ambiguous term/abbreviation, the data set contains a maximum of 100 instances per sense obtained from MEDLINE.We evaluated the reliability of the MSH WSD data set using existing knowledge-based methods and compared their performance to that of the results previously obtained by these algorithms on the pre-existing data set, NLM WSD. We show that the knowledge-based methods achieve different results but keep their relative performance except for the Journal Descriptor Indexing (JDI) method, whose performance is below the other methods.ConclusionsThe MSH WSD data set allows the evaluation of WSD algorithms in the biomedical domain. Compared to previously existing data sets, MSH WSD contains a larger number of biomedical terms/abbreviations and covers the largest set of UMLS Semantic Types. Furthermore, the MSH WSD data set has been generated automatically reusing already existing annotations and, therefore, can be regenerated from subsequent UMLS versions.

[1]  George A. Miller,et al.  Using Corpus Statistics and WordNet Relations for Sense Identification , 1998, CL.

[2]  Hongfang Liu,et al.  Research Paper: Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS , 2002, J. Am. Medical Informatics Assoc..

[3]  Goran Nenadic,et al.  Building and Evaluating Resources for Biomedical Text Mining , 2008 .

[4]  Ted Pedersen,et al.  Supervised and knowledge-based methods for disambiguating terms in biomedical text using the umls and metamap , 2009 .

[5]  Bridget T. McInnes An Unsupervised Vector Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline , 2008, ACL.

[6]  Thomas C. Rindflesch,et al.  Effects of information and machine learning algorithms on word sense disambiguation with small datasets , 2005, Int. J. Medical Informatics.

[7]  Halil Kilicoglu,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment , 2006, J. Assoc. Inf. Sci. Technol..

[8]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[9]  KilicogluHalil,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing , 2006 .

[10]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[11]  Marc Weeber,et al.  Developing a test collection for biomedical word sense disambiguation , 2001, AMIA.

[12]  Ted Pedersen The effect of different context representations on word sense discrimination in biomedical texts , 2010, IHI.

[13]  Dietrich Rebholz-Schuhmann,et al.  Assessment of disease named entity recognition on a corpus of annotated sentences , 2008, BMC Bioinformatics.

[14]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[15]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[16]  Dietrich Rebholz-Schuhmann,et al.  BIOINFORMATICS ORIGINAL PAPER Data and text mining Resolving abbreviations to their senses in Medline , 2005 .

[17]  Hongfang Liu,et al.  Research Paper: A Multi-aspect Comparison Study of Supervised Word Sense Disambiguation , 2004, J. Am. Medical Informatics Assoc..

[18]  Alexander S. Yeh,et al.  More accurate tests for the statistical significance of result differences , 2000, COLING.

[19]  Hongfang Liu,et al.  Disambiguating Ambiguous Biomedical Terms in Biomedical Narrative Text: An Unsupervised Method , 2001, J. Biomed. Informatics.

[20]  Tse-Yi Wang,et al.  Reanalyze unassigned reads in Sanger based metagenomic data using conserved gene adjacency , 2010, BMC Bioinformatics.

[21]  Antonio Jimeno-Yepes,et al.  Knowledge-based biomedical word sense disambiguation: comparison of approaches , 2010, BMC Bioinformatics.

[22]  Carol Friedman,et al.  Generating quality word sense disambiguation test sets based on MeSH indexing , 2009, AMIA.

[23]  Mark Stevenson,et al.  Disambiguation of Biomedical Abbreviations , 2009, BioNLP@HLT-NAACL.

[24]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..