The IMP historical Slovene language resources

The paper describes the combined results of several projects which constitute a basic language resource infrastructure for printed historical Slovene. The IMP language resources consist of a digital library, an annotated corpus and a lexicon, which are interlinked and uniformly encoded following the Text Encoding Initiative Guidelines. The library holds about 650 units (mostly complete books) consisting of facsimiles with 45,000 pages as well as hand-corrected and structured transcriptions. The hand-annotated corpus has 300,000 tokens, where each word is tagged with its modernised word form, lemma, part-of-speech and, in cases of archaic words, its nearest contemporary equivalents. This information was extracted into the lexicon, which also covers an extended target-annotated corpus, resulting in 20,000 lemmas (of these 4,000 archaic) with 50,000 modern word forms and 70,000 attested forms. We have also developed a program to modernise, tag and lemmatise historical Slovene, and annotated the digital library with it, producing an automatically annotated corpus of 15 million words. To serve the humanities, the digital library and lexicon are available for reading and browsing on the web and the corpora via a concordancer. For language technology research and development the resources are available in source TEI XML under the Creative Commons Attribution licence. The paper presents the IMP resources, available from http://nl.ijs.si/imp/, the process of their compilation, encoding and dissemination, and concludes with directions for future research.

[1]  Yves Scherrer,et al.  Modernizing historical Slovene words with character-based SMT , 2013, BSNLP@ACL.

[2]  Paul Bennett,et al.  A Gold Standard Corpus of Early Modern German , 2011, Linguistic Annotation Workshop.

[3]  Felipe Sánchez-Martínez,et al.  An open diachronic corpus of historical Spanish , 2013, Lang. Resour. Evaluation.

[4]  Ulrich Reffle Efficiently generating correction suggestions for garbled tokens of historical language , 2011, Nat. Lang. Eng..

[5]  Karel Kucera The General Principles of the Diachronic Part of the Czech National Corpus , 1999, TSD.

[6]  Tomaz Erjavec,et al.  Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene , 2011, LaTeCH@ACL.

[7]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[8]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[9]  Tomaz Erjavec,et al.  Standardizing Tweets with Character-Level Machine Translation , 2014, CICLing.

[10]  Simon Krek,et al.  Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS , 2013, Language Resources and Evaluation.

[11]  Tomaz Erjavec,et al.  The goo300k corpus of historical Slovene , 2012, LREC.

[12]  Esslli Site,et al.  Natural Language Processing for Historical Texts , 2012 .

[13]  Dawn Archer,et al.  Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora , 2007 .

[14]  Gemma Boleda,et al.  Annotation and Representation of a Diachronic Corpus of Spanish , 2010, LREC.

[15]  Michael Piotrowski,et al.  Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[16]  Ein Werkstättenbericht Deutsch-slowenische/kroatische Übersetzung 1848-1918 , 2007 .

[17]  Janusz S. Bień The IMPACT project Polish Ground-Truth texts as a Djvu corpus , 2014 .

[18]  Oliver Christ,et al.  A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[19]  Pavel Rychlý,et al.  Manatee/Bonito - A Modular Corpus Manager , 2007, RASLAN.

[20]  Apostolos Antonacopoulos,et al.  Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments , 2011, 2011 International Conference on Document Analysis and Recognition.

[21]  Tomaž Erjavec,et al.  An Architecture for Editing Complex Digital Documents , 2007 .

[22]  Apostolos Antonacopoulos,et al.  The PAGE (Page Analysis and Ground-Truth Elements) Format Framework , 2010, 2010 20th International Conference on Pattern Recognition.

[23]  Tomaz Erjavec,et al.  Lexicon Construction and Corpus Annotation of Historical Language with the CoBaLT Editor , 2012, LaTeCH@EACL.

[24]  Tomaz Erjavec,et al.  MULTEXT-East: morphosyntactic resources for Central and Eastern European languages , 2011, Language Resources and Evaluation.

[25]  Marcin Werla,et al.  Creation of Textual Versions of Historical Documents from Polish Digital Libraries , 2012, TPDL.