Automatic Idiom Identification in Wiktionary

Online resources, such as Wiktionary, provide an accurate but incomplete source of idiomatic phrases. In this paper, we study the problem of automatically identifying idiomatic dictionary entries with such resources. We train an idiom classifier on a newly gathered corpus of over 60,000 Wiktionary multi-word definitions, incorporating features that model whether phrase meanings are constructed compositionally. Experiments demonstrate that the learned classifier can provide high quality idiom labels, more than doubling the number of idiomatic entries from 7,764 to 18,155 at precision levels of over 65%. These gains also translate to idiom detection in sentences, by simply using known word sense disambiguation algorithms to match phrases to their definitions. In a set of Wiktionary definition example sentences, the more complete set of idioms boosts detection recall by over 28 percentage points.

[1]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[2]  Afsaneh Fazly,et al.  Automatically Constructing a Lexicon of Verb Phrase Idiomatic Combinations , 2006, EACL.

[3]  Simone Teufel,et al.  Statistical Metaphor Processing , 2013, CL.

[4]  Eugenie Giesbrecht,et al.  Automatic Identification of Non-Compositional Multi-Word Expressions using Latent Semantic Analysis , 2006 .

[5]  Timothy Baldwin,et al.  Combining resources for MWE-token classification , 2012, *SEM@NAACL-HLT.

[6]  Stefan Evert,et al.  Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties , 2006 .

[7]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[8]  Anoop Sarkar,et al.  A Clustering Approach for Nearly Unsupervised Recognition of Nonliteral Language , 2006, EACL.

[9]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[10]  John Bryant,et al.  Catching Metaphors , 2006 .

[11]  Mona Diab,et al.  Verb noun construction MWE token supervised classification , 2009 .

[12]  Anna Korhonen,et al.  Metaphor Identification Using Verb and Noun Clustering , 2010, COLING.

[13]  Afsaneh Fazly,et al.  Pulling their Weight: Exploiting Syntactic Forms for the Automatic Identification of Idiomatic Expressions in Context , 2007 .

[14]  Caroline Sporleder,et al.  Classifier Combination for Contextual Idiom Detection Without Labelled Data , 2009, EMNLP.

[15]  Suzanne Stevenson,et al.  The VNC-Tokens Dataset , 2008 .

[16]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[17]  Iryna Gurevych,et al.  Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary , 2008, LREC.