Casting a Wide Net: Robust Extraction of Potentially Idiomatic Expressions

Idiomatic expressions like `out of the woods' and `up the ante' present a range of difficulties for natural language processing applications. We present work on the annotation and extraction of what we term potentially idiomatic expressions (PIEs), a subclass of multiword expressions covering both literal and non-literal uses of idiomatic expressions. Existing corpora of PIEs are small and have limited coverage of different PIE types, which hampers research. To further progress on the extraction and disambiguation of potentially idiomatic expressions, larger corpora of PIEs are required. In addition, larger corpora are a potential source for valuable linguistic insights into idiomatic expressions and their variability. We propose automatic tools to facilitate the building of larger PIE corpora, by investigating the feasibility of using dictionary-based extraction of PIEs as a pre-extraction tool for English. We do this by assessing the reliability and coverage of idiom dictionaries, the annotation of a PIE corpus, and the automatic extraction of PIEs from a large corpus. Results show that combinations of dictionaries are a reliable source of idiomatic expressions, that PIEs can be annotated with a high reliability (0.74-0.91 Fleiss' Kappa), and that parse-based PIE extraction yields highly accurate performance (88% F1-score). Combining complementary PIE extraction methods increases reliability further, to over 92% F1-score. Moreover, the extraction method presented here could be extended to other types of multiword expressions and to other languages, given that sufficient NLP tools are available.

[1]  David Minugh The filling in the sandwich: internal modification of idioms , 2007 .

[2]  Aline Villavicencio,et al.  The availability of verb-particle constructions in lexical resources: How much is enough? , 2005, Comput. Speech Lang..

[3]  Jean-Yves Antoine,et al.  Towards a Variability Measure for Multiword Expressions , 2018, NAACL-HLT.

[4]  Gorka Labaka,et al.  Using Linguistic Data for English and Spanish Verb-Noun Combination Identification , 2016, COLING.

[5]  Pramod Viswanath,et al.  Geometry of Compositionality , 2017, AAAI.

[6]  Silvia Bernardini,et al.  Introducing and evaluating ukWaC , a very large web-derived corpus of English , 2008 .

[7]  I. Sag,et al.  Idioms , 2015 .

[8]  Alun D. Preece,et al.  The role of idioms in sentiment analysis , 2015, Expert Syst. Appl..

[9]  Afsaneh Fazly,et al.  Unsupervised Type and Token Identification of Idiomatic Expressions , 2009, CL.

[10]  Agata Savary,et al.  Literal readings of multiword expressions: as scarce as hen’s teeth , 2018, TLT.

[11]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[12]  Carlos Ramisch,et al.  Survey: Multiword Expression Processing: A Survey , 2017, CL.

[13]  Ingrid Fischer,et al.  Parsing decomposable idioms , 1996, COLING 1996.

[14]  Ari Rappoport,et al.  Multi-Word Expression Identification Using Sentence Surface Features , 2009, EMNLP.

[15]  Simon Krek,et al.  Edition 1.1 of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions , 2018, COLING 2018.

[16]  Caroline Sporleder,et al.  Idioms in Context: The IDIX Corpus , 2010, LREC.

[17]  R. Harald Baayen,et al.  Understanding Idiomatic Variation , 2017, MWE@EACL.

[18]  Jean-Yves Antoine,et al.  If you’ve seen some, you’ve seen them all: Identifying variants of multiword expressions , 2018, COLING.

[19]  John D. Kelleher,et al.  An Empirical Study of the Impact of Idioms on Phrase Based Statistical Machine Translation of English to Brazilian-Portuguese , 2014, HyTra@EACL.

[20]  Yuji Matsumoto,et al.  Universal Dependencies 2.1 , 2017 .

[21]  Suzanne Stevenson,et al.  The VNC-Tokens Dataset , 2008 .

[22]  Malvina Nissim,et al.  Modeling the internal variability of multiword expressions through a pattern-based method , 2013, TSLP.

[23]  Alessandro Lenci,et al.  Lexical Variability and Compositionality: Investigating Idiomaticity with Distributional Semantic Models , 2016, MWE@ACL.

[24]  Timothy Baldwin,et al.  A Resource for Evaluating the Deep Lexical Acquisition of English Verb-Particle Constructions , 2008, LREC 2008.

[25]  John A. Carroll,et al.  Applied morphological processing of English , 2001, Natural Language Engineering.

[26]  Luke S. Zettlemoyer,et al.  Automatic Idiom Identification in Wiktionary , 2013, EMNLP.

[27]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[28]  Judith Siefring,et al.  From the horse's mouth : Oxford dictionary of English idioms , 2009 .

[29]  Mark A. Finlayson,et al.  Detecting Multi-Word Expressions Improves Word Sense Disambiguation , 2011, MWE@ACL.

[30]  Jean Véronis,et al.  MACHINE READABLE DICTIONARIES: WHAT HAVE WE LEARNED, WHERE DO WE GO? , 1999 .

[31]  Caroline Sporleder,et al.  Unsupervised Recognition of Literal and Non-Literal Use of Idiomatic Expressions , 2009, EACL.

[32]  Ioannis Korkontzelos,et al.  SemEval-2013 Task 5: Evaluating Phrasal Semantics , 2013, *SEMEVAL.

[33]  Nicole Grégoire,et al.  Untangling Multiword Expressions: A study on the representation and variation of Dutch multiword expressions , 2009 .

[34]  Pierre Isabelle,et al.  A Challenge Set Approach to Evaluating Machine Translation , 2017, EMNLP.

[35]  Nathan Schneider,et al.  SemEval-2016 Task 10: Detecting Minimal Semantic Units and their Meanings (DiMSUM) , 2016, *SEMEVAL.

[36]  Paul Cook,et al.  A Word Embedding Approach to Identifying Verb-Noun Idiomatic Combinations , 2016, MWE@ACL.

[37]  Violeta Seretan,et al.  Collocation extraction based on syntactic parsing , 2008 .