The PROIEL treebank family: a standard for early attestations of Indo-European languages

This article describes a family of dependency treebanks of early attestations of Indo-European languages originating in the parallel treebank built by the members of the project pragmatic resources in old Indo-European languages. The treebanks all share a set of open-source software tools, including a web annotation interface, and a set of annotation schemes and guidelines developed especially for the project languages. The treebanks use an enriched dependency grammar scheme complemented by detailed morphological tags, which have proved sufficient to give detailed descriptions of these richly inflected languages, and which have been easy to adapt to new languages. We describe the tools and annotation schemes and discuss some challenges posed by the various languages that have been annotated. We also discuss problems with tokenisation, sentence division and lemmatisation, commonly encountered in ancient and mediaeval texts, and challenges associated with low levels of standardisation and ongoing morphological and syntactic change.

[1]  Jonas Kuhn,et al.  Data-driven Dependency Parsing With Empty Heads , 2012, COLING.

[2]  Gerlof Bouma,et al.  Experiments on sentence segmentation in Old Swedish editions , 2013 .

[3]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[4]  Suzanne Romaine,et al.  The Cambridge history of the English language , 1992 .

[5]  Hanne M. Eckhoff,et al.  Aspect and prefixation in Old Church Slavonic , 2015 .

[6]  Mari Johanne Hertzenberg Classical and Romance Usages of Ipse in the Vulgate , 2011 .

[7]  Rebecka Helena Lindberg Definiteness in Old Church Slavonic : A Study of How Long and Short Form in Adjectives Reflect Information Status , 2013 .

[8]  Aleksandrs Berdicevskis,et al.  Estimating Grammeme Redundancy by Measuring Their Importance for Syntactic Parser Performance , 2015 .

[9]  Bruce Mitchell,et al.  Old English Syntax , 1985 .

[10]  Arne Skjaerholt More, Faster: Accelerated Corpus Annotation with Statistical Taggers. , 2011 .

[11]  Dag T. T. Haug,et al.  The theoretical foundations of givenness annotation , 2014 .

[12]  Benoît Sagot,et al.  Influence of Pre-Annotation on POS-Tagged Corpus Development , 2010, Linguistic Annotation Workshop.

[13]  Gerlof Bouma,et al.  Old Swedish Part-of-Speech Tagging between Variation and External Knowledge , 2016, LaTeCH@ACL.

[14]  Aleksandrs Berdicevskis,et al.  Linguistics vs. Digital Editions: The Tromsø Old Russian and OCS Treebank , 2015 .

[15]  Hanne M. Eckhoff,et al.  Animacy and differential object marking in Old Church Slavonic , 2015 .

[16]  Avery D. Andrews,et al.  Long Distance Agreement in Modern Icelandic , 1982 .

[17]  Hanne M. Eckhoff,et al.  Computational and Linguistic Issues in Designing a Syntactically Annotated Parallel Corpus of Indo-European Languages , 2009, Trait. Autom. des Langues.

[18]  Aleksandrs Berdicevskis,et al.  Automatic parsing as an efficient pre-annotation tool for historical texts , 2016, LT4DH@COLING.

[19]  John Lee,et al.  Porting an Ancient Greek and Latin Treebank , 2010, LREC.

[20]  Hanne Martine Eckhoff Old Russian Possessive Constructions: A Construction Grammar Approach , 2011 .