The LIA Treebank of Spoken Norwegian Dialects

This article presents the LIA treebank of transcribed spoken Norwegian dialects. It consists of dialect recordings made in the period between 1950–1990, which have been digitised, transcribed, and subsequently annotated with morphological and dependency-style syntactic analysis as part of the LIA (Language Infrastructure made Accessible) project at the University of Oslo. In this article, we describe the LIA material of dialect recordings and its transcription, transliteration and further morphosyntactic annotation. We focus in particular on the extension of the native NDT annotation scheme to spoken language phenomena, such as pauses and various types of disfluencies, and present the subsequent conversion of the treebank to the Universal Dependencies scheme. The treebank currently consists of 13,608 tokens, distributed over 1396 segments taken from three different dialects of spoken Norwegian. The LIA treebank annotation is an on-going effort and future releases will extend on the current data set.

[1]  Joakim Nivre,et al.  Universal Stanford dependencies: A cross-linguistic typology , 2014, LREC.

[2]  Richard Johansson,et al.  The CoNLL 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies , 2008, CoNLL.

[3]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[4]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[5]  Erik Velldal,et al.  Joint UD Parsing of Norwegian Bokmål and Nynorsk , 2017, NODALIDA.

[6]  Elizabeth Shriberg DISFLUENCIES IN SWITCHBOARD , 1996 .

[7]  Arne Skjærholt A chance-corrected measure of inter-annotator agreement for syntax , 2014, ACL.

[8]  Jan Hajic,et al.  The Prague Dependency Treebank , 2003 .

[9]  Lilja Øvrelid,et al.  The Norwegian Dependency Treebank , 2014, LREC.

[10]  Arne Skjaerholt A chance-corrected measure of inter-annotator agreement for syntax , 2014, ACL 2014.

[11]  Martin Potthast,et al.  CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2018, CoNLL.

[12]  Joakim Nivre,et al.  Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation , 2006, LREC.

[13]  Bernd Bohnet,et al.  Very high accuracy and fast dependency parsing is not a contradiction , 2010, COLING 2010.

[14]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[15]  Rudolf Rosa,et al.  HamleDT 2.0: Thirty Dependency Treebanks Stanfordized , 2014, LREC.

[16]  Lilja Øvrelid,et al.  Universal Dependencies for Norwegian , 2016, LREC.

[17]  Nizar Habash,et al.  CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2017, CoNLL.

[18]  Mats Wirén,et al.  Universal Dependencies for Swedish Sign Language , 2017, NODALIDA.

[19]  Janne Bondi Johannessen,et al.  A modernised version of the Glossa corpus search system , 2017, NODALIDA.

[20]  Joakim Nivre,et al.  The Universal Dependencies Treebank of Spoken Slovenian , 2016, LREC.

[21]  Janne Bondi Johannessen,et al.  OBT+stat: A combined rule-based and statistical tagger , 2012 .

[22]  Janne Bondi Johannessen,et al.  Annotating and parsing spoken language , 2006 .

[23]  Daniel Zeman,et al.  Reusable Tagset Conversion Using Tagset Drivers , 2008, LREC.