Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues

BackgroundWord sense disambiguation (WSD) is critical in the biomedical domain for improving the precision of natural language processing (NLP), text mining, and information retrieval systems because ambiguous words negatively impact accurate access to literature containing biomolecular entities, such as genes, proteins, cells, diseases, and other important entities. Automated techniques have been developed that address the WSD problem for a number of text processing situations, but the problem is still a challenging one. Supervised WSD machine learning (ML) methods have been applied in the biomedical domain and have shown promising results, but the results typically incorporate a number of confounding factors, and it is problematic to truly understand the effectiveness and generalizability of the methods because these factors interact with each other and affect the final results. Thus, there is a need to explicitly address the factors and to systematically quantify their effects on performance.ResultsExperiments were designed to measure the effect of "sample size" (i.e. size of the datasets), "sense distribution" (i.e. the distribution of the different meanings of the ambiguous word) and "degree of difficulty" (i.e. the measure of the distances between the meanings of the senses of an ambiguous word) on the performance of WSD classifiers. Support Vector Machine (SVM) classifiers were applied to an automatically generated data set containing four ambiguous biomedical abbreviations: BPD, BSA, PCA, and RSV, which were chosen because of varying degrees of differences in their respective senses. Results showed that: 1) increasing the sample size generally reduced the error rate, but this was limited mainly to well-separated senses (i.e. cases where the distances between the senses were large); in difficult cases an unusually large increase in sample size was needed to increase performance slightly, which was impractical, 2) the sense distribution did not have an effect on performance when the senses were separable, 3) when there was a majority sense of over 90%, the WSD classifier was not better than use of the simple majority sense, 4) error rates were proportional to the similarity of senses, and 5) there was no statistical difference between results when using a 5-fold or 10-fold cross-validation method. Other issues that impact performance are also enumerated.ConclusionSeveral different independent aspects affect performance when using ML techniques for WSD. We found that combining them into one single result obscures understanding of the underlying methods. Although we studied only four abbreviations, we utilized a well-established statistical method that guarantees the results are likely to be generalizable for abbreviations with similar characteristics. The results of our experiments show that in order to understand the performance of these ML methods it is critical that papers report on the baseline performance, the distribution and sample size of the senses in the datasets, and the standard deviation or confidence intervals. In addition, papers should also characterize the difficulty of the WSD task, the WSD situations addressed and not addressed, as well as the ML methods and features used. This should lead to an improved understanding of the generalizablility and the limitations of the methodology.

[1]  KilicogluHalil,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing , 2006 .

[2]  David J. Hand,et al.  Construction and Assessment of Classification Rules , 1997 .

[3]  Thomas C. Rindflesch,et al.  Effects of information and machine learning algorithms on word sense disambiguation with small datasets , 2005, Int. J. Medical Informatics.

[4]  Hwee Tou Ng,et al.  An Empirical Evaluation of Knowledge Sources and Learning Algorithms for Word Sense Disambiguation , 2002, EMNLP.

[5]  Hongfang Liu,et al.  Research Paper: A Multi-aspect Comparison Study of Supervised Word Sense Disambiguation , 2004, J. Am. Medical Informatics Assoc..

[6]  A. Valencia,et al.  Text-mining and information-retrieval services for molecular biology , 2005, Genome Biology.

[7]  George Hripcsak,et al.  Analysis of Variance of Cross-Validation Estimators of the Generalization Error , 2005, J. Mach. Learn. Res..

[8]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.

[9]  Ted Pedersen,et al.  Distinguishing Word Senses in Untagged Text , 1997, EMNLP.

[10]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[11]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[12]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[13]  Magnus Merkel,et al.  Combination of Contextual Features for Word Sense Disambiguation: LIU-WSD , 2001, SENSEVAL@ACL.

[14]  John G. Cleary,et al.  AZuRE, a scalable system for automated term disambiguation of gene and protein names , 2004 .

[15]  Tapio Salakoski,et al.  New Techniques for Disambiguation in Natural Language and Their Application to Biological Text , 2004, J. Mach. Learn. Res..

[16]  Yorick Wilks,et al.  Providing machine tractable dictionary tools , 1990, Machine Translation.

[17]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[18]  ResnikPhilip,et al.  Distinguishing systems and distinguishing senses: new evaluation methods for Word Sense Disambiguation , 1999 .

[19]  Ted Pedersen,et al.  Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation , 2004, CoNLL.

[20]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[21]  Martijn J. Schuemie,et al.  Word Sense Disambiguation in the Biomedical Domain: An Overview , 2005, J. Comput. Biol..

[22]  Keinosuke Fukunaga,et al.  Effects of Sample Size in Classifier Design , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[24]  Hongfang Liu,et al.  Research Paper: Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS , 2002, J. Am. Medical Informatics Assoc..

[25]  Marc Weeber,et al.  Text-based discovery in biomedicine: the architecture of the DAD-system , 2000, AMIA.

[26]  John G. Cleary,et al.  AZuRE, a scalable system for automated term disambiguation of gene and protein names , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[27]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2006, Nucleic Acids Res..

[28]  Shlomo Argamon,et al.  Minimizing Manual Annotation Cost in Supervised Training from Corpora , 1996, ACL.

[29]  Padmini Srinivasan,et al.  Gene Terms and English Words: An Ambiguous Mix , .

[30]  Raymond J. Mooney,et al.  Comparative Experiments on Disambiguating Word Senses: An Illustration of the Role of Bias in Machine Learning , 1996, EMNLP.

[31]  Hongfang Liu,et al.  Gene name ambiguity of eukaryotic nomenclatures , 2005, Bioinform..

[32]  W. N. Locke,et al.  Machine Translation of Languages , 1956 .

[33]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[34]  Janyce Wiebe,et al.  Word-Sense Disambiguation Using Decomposable Models , 1994, ACL.

[35]  Halil Kilicoglu,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment , 2006, J. Assoc. Inf. Sci. Technol..

[36]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[37]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..

[38]  James Pustejovsky,et al.  Automatic Extraction of Acronym-meaning Pairs from MEDLINE Databases , 2001, MedInfo.

[39]  Dietrich Rebholz-Schuhmann,et al.  BIOINFORMATICS ORIGINAL PAPER Data and text mining Resolving abbreviations to their senses in Medline , 2005 .

[40]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[41]  Martijn J. Schuemie,et al.  Thesaurus-based disambiguation of gene symbols , 2005, BMC Bioinformatics.

[42]  David J. Hand Assessing classification rules , 1994 .

[43]  Mark R. Wade,et al.  Construction and Assessment of Classification Rules , 1999, Technometrics.

[44]  David Yarowsky,et al.  Distinguishing systems and distinguishing senses: new evaluation methods for Word Sense Disambiguation , 1999, Natural Language Engineering.

[45]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..