Construction of Large-scale English Verbal Multiword Expression Annotated Corpus

Multiword expressions (MWEs) consist of groups of tokens, which should be treated as a single syntactic or semantic unit. In this work, we focus on verbal MWEs (VMWEs), whose accurate recognition is challenging because they could be discontinuous (e.g., take .. off). Since previous English VMWE annotations are relatively small-scale in terms of VMWE occurrences and types, we conduct large-scale annotations of VMWEs on the Wall Street Journal portion of English Ontonotes by a combination of automatic annotations and crowdsourcing. Concretely, we first construct a VMWE dictionary based on the English-language Wiktionary. After that, we collect possible VMWE occurrences in Ontonotes and filter candidates with the help of gold dependency trees, then we formalize VMWE annotations as a multiword sense disambiguation problem to exploit crowdsourcing. As a result, we annotate 7,833 VMWE instances belonging to various categories, such as phrasal verbs, light verb constructions, and semi-fixed VMWEs. We hope this large-scale VMWE-annotated resource helps to develop models for MWE recognition and dependency parsing that are aware of English MWEs. Our resource is publicly available.

[1]  Carlos Ramisch,et al.  Joint Dependency Parsing and Multiword Expression Tokenization , 2015, ACL.

[2]  Behrang Q. Zadeh,et al.  The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions , 2017, MWE@EACL.

[3]  Hiroyuki Shindo,et al.  An Efficient Annotation for Phrasal Verbs using Dependency Information , 2015, PACLIC.

[4]  Ozan Arkan Can,et al.  Multiword Expressions in Statistical Dependency Parsing , 2011, SPMRL@IWPT.

[5]  Joakim Nivre,et al.  Multiword Units in Syntactic Parsing , 2004 .

[6]  Timothy Baldwin,et al.  Prepositions in Applications: A Survey and Introduction to the Special Issue , 2009, CL.

[7]  Mitchell P. Marcus,et al.  OntoNotes: A Unified Relational Semantic Representation , 2007, International Conference on Semantic Computing (ICSC 2007).

[8]  Alexandra Kinyon,et al.  Building a Treebank for French , 2000, LREC.

[9]  Yuji Matsumoto,et al.  Construction of English MWE Dictionary and its Application to POS Tagging , 2013, MWE@NAACL-HLT.

[10]  Noah A. Smith,et al.  Comprehensive Annotation of Multiword Expressions in a Social Web Corpus , 2014, LREC.

[11]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[12]  Hiroyuki Shindo,et al.  Construction of an English Dependency Corpus incorporating Compound Function Words , 2016, LREC.

[13]  Veronika Vincze Light Verb Constructions in the SzegedParalellFX English-Hungarian Parallel Corpus , 2012, LREC.

[14]  Hiroyuki Shindo,et al.  English Multiword Expression-aware Dependency Parsing Including Named Entities , 2017, ACL.

[15]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[16]  Marie Candito,et al.  Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing , 2014, ACL.