Syntactic Identification of Occurrences of Multiword Expressions in Text using a Lexicon with Dependency Structures

We deal with syntactic identification of occurrences of multiword expression (MWE) from an existing dictionary in a text corpus. The MWEs we identify can be of arbitrary length and can be interrupted in the surface sentence. We analyse and compare three approaches based on linguistic analysis at a varying level, ranging from surface word order to deep syntax. The evaluation is conducted using two corpora: the Prague Dependency Treebank and Czech National Corpus. We use the dictionary of multiword expressions SemLex, that was compiled by annotating the Prague Dependency Treebank and includes deep syntactic dependency trees of all MWEs.

[1]  Josef van Genabith,et al.  Exploiting Multi-Word Units in History-Based Probabilistic Generation , 2007, EMNLP-CoNLL.

[2]  Drahomíra johanka Spoustová Combining Statistical and Rule-Based Approaches to Morphological Tagging of Czech Texts , 2008, Prague Bull. Math. Linguistics.

[3]  Frank Keller,et al.  Lexicalization in Crosslinguistic Probabilistic Parsing: The Case of French , 2005, ACL.

[4]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[5]  Laurie Bauer,et al.  English Word-Formation , 1983 .

[6]  Zdenek Zabokrtský,et al.  TectoMT: Modular NLP Framework , 2010, IceTAL.

[7]  Zdenek Zabokrtský,et al.  Feature Engineering in Maximum Spanning Tree Dependency Parser , 2007, International Conference on Text, Speech and Dialogue.

[8]  Timothy Baldwin,et al.  Fleshing it out: A Supervised Approach to MWE-token and MWE-type Classification , 2011, IJCNLP.

[9]  Daisuke Kawahara,et al.  Construction of an Idiom Corpus and its Application to Idiom Identification based on WSD Incorporating Idiom-Specific Features , 2008, EMNLP.

[10]  Caroline Sporleder,et al.  Classifier Combination for Contextual Idiom Detection Without Labelled Data , 2009, EMNLP.

[11]  Marine Carpuat,et al.  Task-based Evaluation of Multiword Expressions: a Pilot Study in Statistical Machine Translation , 2010, NAACL.

[12]  Igor Mel’čuk,et al.  Lexical functions: a tool for the description of lexical relations in a lexicon , 1996 .

[13]  Ozan Arkan Can,et al.  Multiword Expressions in Statistical Dependency Parsing , 2011, SPMRL@IWPT.

[14]  Jan Hajic Disambiguation of Rich Inflection - Computational Morphology of Czech , 2004 .

[15]  Afsaneh Fazly,et al.  Pulling their Weight: Exploiting Syntactic Forms for the Automatic Identification of Idiomatic Expressions in Context , 2007 .

[16]  Eugenie Giesbrecht,et al.  Automatic Identification of Non-Compositional Multi-Word Expressions using Latent Semantic Analysis , 2006 .

[17]  Eduard Bejček,et al.  Annotation of multiword expressions in the Prague dependency treebank , 2010, IJCNLP.

[18]  Petr Sgall,et al.  The Meaning Of The Sentence In Its Semantic And Pragmatic Aspects , 1986 .

[19]  Christopher D. Manning,et al.  Parsing Models for Identifying Multiword Expressions , 2013, CL.

[20]  Pavel Pecina An Extensive Empirical Study of Collocation Extraction Methods , 2005, ACL.

[21]  Joakim Nivre,et al.  Multiword Units in Syntactic Parsing , 2004 .