Automatic Idiom Recognition with Word Embeddings

Expressions, such as add fuel to the fire, can be interpreted literally or idiomatically depending on the context they occur in. Many Natural Language Processing applications could improve their performance if idiom recognition were improved. Our approach is based on the idea that idioms and their literal counterparts do not appear in the same contexts. We propose two approaches: (1) Compute inner product of context word vectors with the vector representing a target expression. Since literal vectors predict well local contexts, their inner product with contexts should be larger than idiomatic ones, thereby telling apart literals from idioms; and (2) Compute literal and idiomatic scatter (covariance) matrices from local contexts in word vector space. Since the scatter matrices represent context distributions, we can then measure the difference between the distributions using the Frobenius norm. For comparison, we implement [8, 16, 24] and apply them to our data. We provide experimental results validating the proposed techniques.

[1]  Jing Peng,et al.  Automatic Detection of Idiomatic Clauses , 2013, CICLing.

[2]  Cristina Cacciari,et al.  The place of idioms in a literal and metaphorical world. , 1993 .

[3]  I. R. McCaig,et al.  Oxford Dictionary of Current Idiomatic English. Vol. 1: Verbs with Prepositions and Particles@@@Oxford Dictionary of Current English. Vol. 2: Phrase, Clause, and Sentence Idioms , 1985 .

[4]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[5]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Aline Villavicencio,et al.  Lexical Encoding of MWEs , 2004 .

[8]  Caroline Sporleder,et al.  Unsupervised Recognition of Literal and Non-Literal Use of Idiomatic Expressions , 2009, EACL.

[9]  Afsaneh Fazly,et al.  Pulling their Weight: Exploiting Syntactic Forms for the Automatic Identification of Idiomatic Expressions in Context , 2007 .

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  Afsaneh Fazly,et al.  Unsupervised Type and Token Identification of Idiomatic Expressions , 2009, CL.

[12]  I. Sag,et al.  Idioms , 2015 .

[13]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[14]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[15]  Dominic Widdows,et al.  Automatic Extraction of Idioms using Graph Analysis and Asymmetric Lexicosyntactic Patterns , 2005, ACL 2005.

[16]  Eugenie Giesbrecht,et al.  Automatic Identification of Non-Compositional Multi-Word Expressions using Latent Semantic Analysis , 2006 .

[17]  Suzanne Stevenson,et al.  The VNC-Tokens Dataset , 2008 .

[18]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[19]  Anoop Sarkar,et al.  A Clustering Approach for Nearly Unsupervised Recognition of Nonliteral Language , 2006, EACL.

[20]  Ekaterina Vylomova,et al.  Classifying Idiomatic and Literal Expressions Using Topic Models and Intensity of Emotions , 2014, EMNLP.

[21]  I. R. McCaig,et al.  Oxford Dictionary of Current Idiomatic English , 1994 .

[22]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[23]  Caroline Sporleder,et al.  Using Gaussian Mixture Models to Detect Figurative Language in Context , 2010, NAACL.

[24]  Christiane Fellbaum,et al.  Corpus-based Studies of German Idioms and Light Verbs , 2006 .