Using a Maximum Entropy Classifier to link “good” corpus examples to dictionary senses

A particular problem of maintaining dictionaries consists of replacing outdated example sentences by corpus examples that are up-to-date. Extraction methods such as the good example finder (GDEX; Kilgarriff, 2008) have been developed to tackle this problem. We extend GDEX to polysemous entries by applying machine learning techniques in order to map the example sentences to the appropriate dictionary senses. The idea is to enrich our knowledge base by computing the set of all collocations and to use a maximum entropy classifier (MEC; Nigam, 1999) to learn the correct mapping between corpus sentence and its correct dictionary sense. Our method is based on hand labeled sense annotations. Results reveal an accuracy of 49.16% (MEC) which is significantly better than the Lesk algorithm (31.17%).

[1]  Hwee Tou Ng,et al.  An Empirical Evaluation of Knowledge Sources and Learning Algorithms for Word Sense Disambiguation , 2002, EMNLP.

[2]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[3]  Lothar Lemnitzer,et al.  Automatic example sentence extraction for a contemporary German dictionary , 2012 .

[4]  Paola Velardi,et al.  Structural semantic interconnections: a knowledge-based approach to word sense disambiguation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  J. Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[6]  Adam Kilgarriff,et al.  GDEX: Automatically Finding Good Dictionary Examples in a Corpus , 2008 .

[7]  Pavel Rychlý,et al.  A Lexicographer-Friendly Association Score , 2008, RASLAN.

[8]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[9]  Jörg Didakowski Local syntactic tagging of large corpora using weighted finite state transducers , 2008, KONVENS.

[10]  Alexander Geyken,et al.  From DWDS Corpora to a German Word Profile – Methodological Problems and Solutions , 2013 .

[11]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[12]  Claudia Kunze,et al.  GermaNet - representation, visualization, application , 2002, LREC.

[13]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[14]  Raymond J. Mooney,et al.  Comparative Experiments on Disambiguating Word Senses: An Illustration of the Role of Bias in Machine Learning , 1996, EMNLP.

[15]  Philippe Langlais,et al.  Evaluating Variants of the Lesk Approach for Disambiguating Words , 2004, LREC.

[16]  Lluís Màrquez i Villodre,et al.  Boosting Applied to Word Sense Disambiguation , 2000, ArXiv.

[17]  Richard Johansson,et al.  Semi-automatic selection of best corpus examples for Swedish: Initial algorithm evaluation , 2012 .

[18]  Timothy Baldwin,et al.  Learning Word Sense Distributions, Detecting Unattested Senses and Identifying Novel Senses Using Topic Models , 2014, ACL.

[19]  Alexander Geyken,et al.  Methoden bei der Wörterbuchplanung in Zeiten der Internetlexikographie [Methods in dictionary planning in the era of Internet lexicography / Méthodes en planification dictionnairique à l’époque de la lexicographie en ligne] , 2014 .

[20]  Ted Pedersen,et al.  Learning Probabilistic Models of Word Sense Disambiguation , 2007, ArXiv.

[21]  Timothy Baldwin,et al.  Applying a Word-sense Induction System to the Automatic Extraction of Diverse Dictionary Examples , 2014 .

[22]  Iztok Kosem,et al.  GDEX for Slovene , 2011 .