Annotation and Extraction of Multiword Expressions in Turkish Treebanks

Multiword expressions (MWEs) present particular and distinctive semantic properties, hence their automatic extraction receives special attention from the natural language processing (NLP) and corpus linguistics community, and is still an active research area. Unfortunately, the creation of necessary resources for this task is quite rigorous and many languages suffer from the lack of these; as in the case for Turkish. This study presents our MWE annotations on recently introduced Turkish Treebanks, which focuses on annotating various types of linguistic units and expressions, including named entities, numerical expressions, idiomatic phrases, verb phrases with auxiliaries and duplications. The paper aims to provide a benchmark and pave the way towards further MWE extraction research for Turkish. To this end, the paper also introduces our experimental results with seven baseline approaches, a dependency parser and a previously introduced rule-based extractor on these annotated corpora. Our highest performances achieved over these resources are about 60% F-scores.

[1]  Agata Savary,et al.  Computational Inflection of Multi-Word Units, a contrastive study of lexical approaches , 2009 .

[2]  Yulia Tsvetkov,et al.  Extraction of Multi-word Expressions from Small Parallel Corpora , 2010, COLING.

[3]  Marie Candito,et al.  Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing , 2014, ACL.

[4]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[5]  Bahar Karaoglan,et al.  Collocation Extraction in Turkish Texts Using Statistical Methods , 2010, IceTAL.

[6]  Mohammed A. Attia Accommodating Multiword Expressions in an Arabic LFG Grammar , 2006, FinTAL.

[7]  Joakim Nivre,et al.  Multiword Units in Syntactic Parsing , 2004 .

[8]  Ozan Arkan Can,et al.  Multiword Expressions in Statistical Dependency Parsing , 2011, SPMRL@IWPT.

[9]  Veronika Vincze,et al.  Dependency Parsing for Identifying Hungarian Light Verb Constructions , 2013, IJCNLP.

[10]  Dilek Z. Hakkani-Tür,et al.  Building a Turkish Treebank , 2003 .

[11]  Paul Rayson,et al.  Automatic Extraction of Chinese Multiword Expressions with a Statistical Tool , 2006 .

[12]  Gökhan Akın Åžeker,et al.  Initial Explorations on using CRFs for Turkish Named Entity Recognition , 2012, Coling 2012.

[13]  Kemal Oflazer,et al.  Dependency Parsing of Turkish , 2008, CL.

[14]  Eric Laporte,et al.  A French Corpus Annotated for Multiword Expressions with Adverbial Function , 2008, LAW II 2008.

[15]  Eduard Bejček,et al.  Annotation of multiword expressions in the Prague dependency treebank , 2010, IJCNLP.

[16]  Kemal Oflazer,et al.  Integrating Morphology with Multi-word Expression Processing in Turkish , 2004 .

[17]  Daniel Jurafsky,et al.  Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem? , 2001, EMNLP.

[18]  Eric Laporte,et al.  A French Corpus Annotated for Multiword Nouns , 2008, LREC 2008.