Word sense disambiguation across two domains: Biomedical literature and clinical notes

The aim of this study is to explore the word sense disambiguation (WSD) problem across two biomedical domains-biomedical literature and clinical notes. A supervised machine learning technique was used for the WSD task. One of the challenges addressed is the creation of a suitable clinical corpus with manual sense annotations. This corpus in conjunction with the WSD set from the National Library of Medicine provided the basis for the evaluation of our method across multiple domains and for the comparison of our results to published ones. Noteworthy is that only 20% of the most relevant ambiguous terms within a domain overlap between the two domains, having more senses associated with them in the clinical space than in the biomedical literature space. Experimentation with 28 different feature sets rendered a system achieving an average F-score of 0.82 on the clinical data and 0.86 on the biomedical literature.

[1]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[2]  Halil Kilicoglu,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment , 2006, J. Assoc. Inf. Sci. Technol..

[3]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[4]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[6]  L. Ohno-Machado Journal of Biomedical Informatics , 2001 .

[7]  Thomas C. Rindflesch,et al.  Effects of information and machine learning algorithms on word sense disambiguation with small datasets , 2005, Int. J. Medical Informatics.

[8]  S.J.J. Smith,et al.  Empirical Methods for Artificial Intelligence , 1995 .

[9]  Boban Velickovic,et al.  Nonstandard models and analytic equivalence relations , 1997 .

[10]  Rie Kubota Ando,et al.  Applying Alternating Structure Optimization to Word Sense Disambiguation , 2006, CoNLL.

[11]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[12]  Martijn J. Schuemie,et al.  Word Sense Disambiguation in the Biomedical Domain: An Overview , 2005, J. Comput. Biol..

[13]  Hongfang Liu,et al.  Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues , 2006, BMC Bioinformatics.

[14]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[15]  E. S. Pearson,et al.  THE USE OF CONFIDENCE OR FIDUCIAL LIMITS ILLUSTRATED IN THE CASE OF THE BINOMIAL , 1934 .

[16]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications , 2007 .

[17]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[18]  Renata Vieira,et al.  A Corpus-based Investigation of Definite Description Use , 1997, CL.

[19]  Rada Mihalcea,et al.  Exploiting Agreement and Disagreement of Human Annotators for Word Sense Disambiguation , 2003 .

[20]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications (Text, Speech and Language Technology) , 2006 .

[21]  Ted Pedersen,et al.  Abbreviation and Acronym Disambiguation in Clinical Discourse , 2005, AMIA.

[22]  Hwee Tou Ng,et al.  An Empirical Evaluation of Knowledge Sources and Learning Algorithms for Word Sense Disambiguation , 2002, EMNLP.

[23]  Marc Weeber,et al.  Developing a test collection for biomedical word sense disambiguation , 2001, AMIA.

[24]  Hongfang Liu,et al.  Research Paper: A Multi-aspect Comparison Study of Supervised Word Sense Disambiguation , 2004, J. Am. Medical Informatics Assoc..

[25]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[26]  Ian Witten,et al.  Data Mining , 2000 .

[27]  Christopher G. Chute,et al.  Domain-specific language models and lexicons for tagging , 2005, J. Biomed. Informatics.

[28]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[29]  Padmini Srinivasan,et al.  Gene Terms and English Words: An Ambiguous Mix , .

[30]  Thomas C. Rindflesch,et al.  Using Symbolic Knowledge in the UMLS to Disambiguate Words in Small Datasets with a Naïve Bayes Classifier , 2004, MedInfo.

[31]  Carol Friedman,et al.  Towards a comprehensive medical language processing system: methods and issues , 1997, AMIA.

[32]  A. Valencia,et al.  Text-mining and information-retrieval services for molecular biology , 2005, Genome Biology.