Gold-Standard Datasets for Annotation of Slovene Computer-Mediated Communication

This paper presents the first publicly available, manually annotated gold-standard datasets for the annotation of Slovene ComputerMediated Communication. In this type of language, diacritics, punctuation and spaces are often omitted, and phonetic spelling and slang words frequently used, which considerably deteriorates the performance of text processing tools that were trained on standard Slovene. Janes-Norm, which contains 7,816 texts or 184,766 tokens, is a gold-standard dataset for tokenisation, sentence segmentation and word normalisation, whereas Janes-Tag, comprising 2,958 texts or 75,276 tokens, was created for training and evaluating morphosyntactic tagging and lemmatisation tools for non-standard Slovene.

[1]  Tomaz Erjavec,et al.  Standardizing Tweets with Character-Level Machine Translation , 2014, CICLing.

[2]  Tomaž Erjavec,et al.  Omogočanje dostopa do korpusov slovenskih spletnih besedil v luči pravnih omejitev , 2016 .

[3]  Tomaz Erjavec,et al.  Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene , 2016, LREC.

[4]  Bruno Pouliquen,et al.  Massive multi lingual corpus compilation: Acquis Communautaire and totale , 2005 .

[5]  Tomaz Erjavec,et al.  Predicting the Level of Text Standardness in User-generated Content , 2015, RANLP.

[6]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[7]  Tomaz Erjavec,et al.  MULTEXT-East: morphosyntactic resources for Central and Eastern European languages , 2011, Language Resources and Evaluation.

[8]  Jennifer-Carmen Frey,et al.  The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts , 2016, CLiC-it/EVALITA.

[9]  Tomaz Erjavec,et al.  The IMP historical Slovene language resources , 2015, Lang. Resour. Evaluation.

[10]  Tomaz Erjavec,et al.  Corpus-Based Diacritic Restoration for South Slavic Languages , 2016, LREC.

[11]  Pavel Rychlý,et al.  Manatee/Bonito - A Modular Corpus Manager , 2007, RASLAN.

[12]  Tomaž Erjavec,et al.  JANES v0.4: Korpus slovenskih spletnih uporabniških vsebin , 2016 .

[13]  Thomas Bartz,et al.  Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge , 2013, J. Lang. Technol. Comput. Linguistics.

[14]  Iryna Gurevych,et al.  WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations , 2013, ACL.