Identifying Multi-word Expressions by Leveraging Morphological and Syntactic Idiosyncrasy

Multi-word expressions constitute a significant portion of the lexicon of every natural language, and handling them correctly is mandatory for various NLP applications. Yet such entities are notoriously hard to define, and are consequently missing from standard lexicons and dictionaries. Multi-word expressions exhibit idiosyncratic behavior on various levels: orthographic, morphological, syntactic and semantic. In this work we take advantage of the morphological and syntactic idiosyncrasy of Hebrew noun compounds and employ it to extract such expressions from text corpora. We show that relying on linguistic information dramatically improves the accuracy of compound extraction, reducing over one third of the errors compared with the best baseline.

[1]  Hagit Borer On the morphological parallelism between compounds and constructs , 1988 .

[2]  Khalil Sima'an,et al.  Part-of-speech tagging of Modern Hebrew text , 2008, Natural Language Engineering.

[3]  L. Glinert The Grammar of Modern Hebrew , 1989 .

[4]  Shuly Wintner,et al.  Definiteness in the Hebrew noun phrase , 2000, Journal of Linguistics.

[5]  Yulia Tsvetkov,et al.  Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content , 2010, LREC.

[6]  Colin Bannard A Measure of Syntactic Flexibility for Automatically Identifying Multiword Expressions in Corpora , 2007 .

[7]  Eugenie Giesbrecht,et al.  Automatic Identification of Non-Compositional Multi-Word Expressions using Latent Semantic Analysis , 2006 .

[8]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[9]  Aline Villavicencio,et al.  Introduction to the special issue on multiword expressions: Having a crack at a hard nut , 2005, Comput. Speech Lang..

[10]  Pavel Pecina AMachine Learning Approach to Multiword Expression Extraction , 2008 .

[11]  Jörg Tiedemann,et al.  Identifying idiomatic expressions using automatic word-alignment , 2006 .

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[14]  Yulia Tsvetkov,et al.  Extraction of Multi-word Expressions from Small Parallel Corpora , 2010, COLING.

[15]  Kemal Oflazer,et al.  Integrating Morphology with Multi-word Expression Processing in Turkish , 2004 .

[16]  Olatz Ansa,et al.  Representation and Treatment of Multiword Expressions in Basque , 2004 .

[17]  Afsaneh Fazly,et al.  Automatically Constructing a Lexicon of Verb Phrase Idiomatic Combinations , 2006, EACL.

[18]  Bar-Ilan University,et al.  WordNet : a Test Case of Aligning Lexical Databases across Languages , 2007 .

[19]  Tim van de Cruys,et al.  Semantics-based Multiword Expression Extraction , 2007 .

[20]  Ray Fabri Compounding and adjective-noun compounds in Maltese , 2009 .

[21]  Alon Itai,et al.  Language resources for Hebrew , 2008, Lang. Resour. Evaluation.

[22]  Carlos Ramisch,et al.  An Evaluation of Methods for the Extraction of Multiword Expressions , 2008, LREC 2008.

[23]  Shuly Wintner,et al.  A Finite-State Morphological Grammar of Hebrew , 2005, Natural Language Engineering.

[24]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.