论文信息 - Extracting Verbal Multiword Data from Rich Treebank Annotation

Extracting Verbal Multiword Data from Rich Treebank Annotation

The PARSEME Shared Task on automatic identification of verbal multiword expressions aims at identifying such expressions in running texts. Typology of verbal multiword expressions, very detailed annotation guidelines and gold-standard data for as many languages as possible will be provided. Since the Prague Dependency Treebank includes Czech multiword expression annotation, it was natural to make an attempt to automatically convert the data into the Shared Task format. However, since the Czech treebank predates the Shared Task annotation guidelines, a prior examination was necessary to determine to which extent the conversion can be fully automatic and how much manual work remains. In this paper, we show that information contained in the Prague Dependency Treebank is sufficient to extract all of the Shared Task categories of verbal multiword expressions relevant for Czech, even if these categories are originally annotated differently; nevertheless, some manual checking and annotation would still be necessary, e.g. for distinguishing borderline cases. 1 Motivation The goal of the PARSEME [11] Shared Task (PST)1 is to develop automatic detection of verbal multiword expressions (VMWEs) for a wide range of languages from different language families. It includes data preparation for the task participants, based on annotation guidelines that were tested on real data for almost twenty languages [16].2 The training and testing data for the PST (3,500 instances per language) are being annotated; while manual annotation is necessary for many languages, reusing existing annotated data is preferred whenever possible. This preference led us to explore the Prague Dependency Treebank (PDT, [1, 4]), which includes quite a rich annotation of MWEs.3 However, the annohttp://multiword.sourceforge.net/sharedtask2017 Also at http://parsemefr.lif.univ-mrs.fr/guidelines-hypertext. Some VMWEs categories were annotated during the creation of the original PDT 2.0, others were annotated particularly for PDT 2.5; PDT 3.0 contains all of them.

[1] Adam Przepiórkowski,et al. PARSEME – PARSing and Multiword Expressions within a European multilingual network , 2015 .

[2] Jan Hajic,et al. Linguistic Annotation : from Links to Cross-Layer Lexicons , 2003 .

[3] Adam Przepiórkowski,et al. A survey of multiword expressions in treebanks , 2015 .

[4] V. Vincze. Annotation guidelines for the PARSEME shared task on automatic detection of verbal Multi Word Expressions version 5 . 0 4 March 2016 , .

[5] Eduard Bejcek,et al. MWEs in Treebanks: From Survey to Guidelines , 2016, LREC.

[6] Eduard Bejček,et al. Annotation of multiword expressions in the Prague dependency treebank , 2010, IJCNLP.

[7] V. Kolárová. Chapter 2. Special valency behavior of Czech deverbal nouns , 2014 .

[8] Jan Hajic,et al. An Analysis of Annotation of Verb-Noun Idiomatic Combinations in a Parallel Dependency Corpus , 2013, MWE@NAACL-HLT.

[9] Diplomová Práce,et al. Univerzita Karlova v Praze Matematicko-fyzikálńı fakulta , 2003 .

[10] Petr Pajas,et al. PDT-VALLEX : Creating a Large-coverage Valency Lexicon for Treebank Annotation , 2003 .

[11] Veronika Kolárová. Valency of deverbal nouns in Czech , 2006, Prague Bull. Math. Linguistics.

[12] Zdeňka Urešová. Valence sloves v Pražském závislostním korpusu , 2012 .

[13] Adam Przepiórkowski,et al. Phraseology in Two Slavic Valency Dictionaries: Limitations and Perspectives , 2016 .