Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format

The paper presents the largest Polish Dependency Bank in Universal Dependencies format – PDBUD – with 22K trees and 352K tokens. PDBUD builds on its previous version, i.e. the Polish UD treebank (PL-SZ), and contains all 8K PL-SZ trees. The PL-SZ trees are checked and possibly corrected in the current edition of PDBUD. Further 14K trees are automatically converted from a new version of Polish Dependency Bank. The PDBUD trees are expanded with the enhanced edges encoding the shared dependents and the shared governors of the coordinated conjuncts and with the semantic roles of some dependents. The conducted evaluation experiments show that PDBUD is large enough for training a high-quality graph-based dependency parser for Polish.

[1]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[2]  Timothy Dozat,et al.  Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task , 2017, CoNLL.

[3]  Andreas Eisele,et al.  DGT-TM: A freely available Translation Memory in 22 languages , 2012, LREC.

[4]  Milan Straka,et al.  Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[5]  Bernd Bohnet,et al.  Very high accuracy and fast dependency parsing is not a contradiction , 2010, COLING 2010.

[6]  Kewei Tu,et al.  Unsupervised Neural Dependency Parsing , 2016, EMNLP.

[7]  Alina Polish Dependency Bank , 2012 .

[8]  Collin F. Baker,et al.  A Frames Approach to Semantic Analysis , 2009 .

[9]  Adam Przepiórkowski,et al.  Parallel and spoken corpora in an open repository of Polish language resources , 2011 .

[10]  Alina Wróblewska,et al.  Polish evaluation dataset for compositional distributional semantics models , 2017, ACL.

[11]  Ondrej Dusek,et al.  HamleDT: Harmonized multi-language dependency treebank , 2014, Lang. Resour. Evaluation.

[12]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[13]  Eliyahu Kiperwasser,et al.  Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations , 2016, TACL.

[14]  Joakim Nivre,et al.  MaltParser: A Data-Driven Parser-Generator for Dependency Parsing , 2006, LREC.

[15]  Marcin Woliński,et al.  A Preliminary Version of Składnica — a Treebank of Polish , 2011 .

[16]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[17]  Josef Ruppenhofer,et al.  FrameNet II: Extended theory and practice , 2006 .