Extracting Verbal Multiword Data from Rich Treebank Annotation

The PARSEME Shared Task on automatic identification of verbal multiword expressions aims at identifying such expressions in running texts. Typology of verbal multiword expressions, very detailed annotation guidelines and gold-standard data for as many languages as possible will be provided. Since the Prague Dependency Treebank includes Czech multiword expression annotation, it was natural to make an attempt to automatically convert the data into the Shared Task format. However, since the Czech treebank predates the Shared Task annotation guidelines, a prior examination was necessary to determine to which extent the conversion can be fully automatic and how much manual work remains. In this paper, we show that information contained in the Prague Dependency Treebank is sufficient to extract all of the Shared Task categories of verbal multiword expressions relevant for Czech, even if these categories are originally annotated differently; nevertheless, some manual checking and annotation would still be necessary, e.g. for distinguishing borderline cases. 1 Motivation The goal of the PARSEME [11] Shared Task (PST)1 is to develop automatic detection of verbal multiword expressions (VMWEs) for a wide range of languages from different language families. It includes data preparation for the task participants, based on annotation guidelines that were tested on real data for almost twenty languages [16].2 The training and testing data for the PST (3,500 instances per language) are being annotated; while manual annotation is necessary for many languages, reusing existing annotated data is preferred whenever possible. This preference led us to explore the Prague Dependency Treebank (PDT, [1, 4]), which includes quite a rich annotation of MWEs.3 However, the annohttp://multiword.sourceforge.net/sharedtask2017 Also at http://parsemefr.lif.univ-mrs.fr/guidelines-hypertext. Some VMWEs categories were annotated during the creation of the original PDT 2.0, others were annotated particularly for PDT 2.5; PDT 3.0 contains all of them.