Handling Sparsity for Verb Noun MWE Token Classification

We address the problem of classifying multiword expression tokens in running text. We focus our study on Verb-Noun Constructions (VNC) that vary in their idiomaticity depending on context. VNC tokens are classified as either idiomatic or literal. Our approach hinges upon the assumption that a literal VNC will have more in common with its component words than an idiomatic one. Commonality is measured by contextual overlap. To this end, we set out to explore different contextual variations and different similarity measures handling the sparsity in the possible contexts via four different parameter variations. Our approach yields state of the art performance with an overall accuracy of 75.54% on a TEST data set.

[1]  Jörg Tiedemann,et al.  Identifying idiomatic expressions using automatic word-alignment , 2006 .

[2]  Afsaneh Fazly,et al.  Pulling their Weight: Exploiting Syntactic Forms for the Automatic Identification of Idiomatic Expressions in Context , 2007 .

[3]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[4]  Suzanne Stevenson,et al.  The VNC-Tokens Dataset , 2008 .

[5]  Suzanne Stevenson,et al.  Distinguishing Subtypes of Multiword Expressions Using Linguistically-Motivated Statistical Measures , 2007 .

[6]  Dekang Lin,et al.  Automatic Identification of Non-compositional Phrases , 1999, ACL.

[7]  Satoshi Sato,et al.  Japanese Idiom Recognition: Drawing a Line between Literal and Idiomatic Meanings , 2006, ACL.

[8]  I. Dan Melamed Automatic Discovery of Non-Compositional Compounds in Parallel Data , 1997, EMNLP.

[9]  Eugenie Giesbrecht,et al.  Automatic Identification of Non-Compositional Multi-Word Expressions using Latent Semantic Analysis , 2006 .

[10]  Aravind K. Joshi,et al.  Detecting Compositionality of Verb-Object Combinations using Selectional Preferences , 2007, EMNLP-CoNLL.

[11]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[12]  Timothy Baldwin,et al.  An Empirical Model of Multiword Expression Decomposability , 2003, ACL 2003.

[13]  Daisuke Kawahara,et al.  Construction of an Idiom Corpus and its Application to Idiom Identification based on WSD Incorporating Idiom-Specific Features , 2008, EMNLP.

[14]  Tim van de Cruys,et al.  Semantics-based Multiword Expression Extraction , 2007 .

[15]  Daniel Jurafsky,et al.  Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem? , 2001, EMNLP.

[16]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[17]  Mirella Lapata,et al.  Vector-based Models of Semantic Composition , 2008, ACL.