Universal Dependencies for Finnish

There has been substantial recent interest in annotation schemes that can be applied consistently to many languages. Building on several recent efforts to unify morphological and syntactic annotation, the Universal Dependencies (UD) project seeks to introduce a cross-linguistically applicable part-of-speech tagset, feature inventory, and set of dependency relations as well as a large number of uniformly annotated treebanks. We present Universal Dependencies for Finnish, one of the ten languages in the recent first release of UD project treebank data. We detail the mapping of previously introduced annotation to the UD standard, describing specific challenges and their resolution. We additionally present parsing experiments comparing the performance of a stateof-the-art parser trained on a languagespecific annotation schema to performance on the corresponding UD annotation. The results show improvement compared to the source annotation, indicating that the conversion is accurate and supporting the feasibility of UD as a parsing target. The introduced tools and resources are available under open licenses from http://bionlp.utu.fi/ud-finnish.html.

[1]  Joakim Nivre,et al.  Universal Stanford dependencies: A cross-linguistic typology , 2014, LREC.

[2]  Bernd Bohnet,et al.  Top Accuracy and Fast Dependency Parsing is not a Contradiction , 2010, COLING.

[3]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[4]  Veronika Laippala,et al.  Syntactic N-gram Collection from a Large-Scale Corpus of Internet Finnish , 2014, Baltic HLT.

[5]  Beáta Megyesi,et al.  Proceedings of the 20th Nordic Conference of Computational Linguistics , 2015 .

[6]  Jennifer Foster,et al.  Cross-lingual Transfer Parsing for Low-Resourced Languages: An Irish Case Study , 2014 .

[7]  Atro Voutilainen FinnTreeBank: Creating a research resource and service for language researchers with Constraint Grammar , 2011 .

[8]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[9]  Simonetta Montemagni,et al.  Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank , 2013, LAW@ACL.

[10]  Auli Hakulinen Iso suomen kielioppi , 2004 .

[11]  Simonetta Montemagni,et al.  Less is More? Towards a Reduced Inventory of Categories for Training a Parser for the Italian Stanford Dependencies , 2014, LREC.

[12]  Veronika Vincze,et al.  Dependency Parsing of Hungarian: Baseline Results and Challenges , 2012, EACL.

[13]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[14]  Tapio Salakoski,et al.  Towards a Dependency-based PropBank of General Finnish , 2013, NODALIDA.

[15]  Mats Wirén,et al.  Universal Dependencies for Swedish Sign Language , 2017, NODALIDA.

[16]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[17]  Tommi A. Pirinen,et al.  HFST Tools for Morphology - An Efficient Open-Source Package for Construction of Morphological Analyzers , 2009, SFCM.

[18]  Samuel R. Bowman,et al.  A Gold Standard Dependency Corpus for English , 2014, LREC.

[19]  Hinrich Schütze,et al.  Efficient Higher-Order CRFs for Morphological Tagging , 2013, EMNLP.

[20]  Veronika Laippala,et al.  Universal Dependencies 1.4 , 2015 .

[21]  Eduard Bejcek,et al.  Prague Dependency Treebank 2.5 – a Revisited Version of PDT 2.0 , 2012, COLING.

[22]  Daniel Zeman,et al.  Reusable Tagset Conversion Using Tagset Drivers , 2008, LREC.

[23]  Joakim Nivre,et al.  Joint Morphological and Syntactic Analysis for Richly Inflected Languages , 2013, TACL.

[24]  Tapio Salakoski,et al.  Building the essential resources for Finnish: the Turku Dependency Treebank , 2013, Language Resources and Evaluation.

[25]  Reut Tsarfaty,et al.  A Unified Morpho-Syntactic Scheme of Stanford Dependencies , 2013, ACL.