Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text

In this paper we present the adaptations of a state-of-the-art tagger for South Slavic languages to non-standard texts on the example of the Slovene language. We investigate the impact of introducing in-domain training data as well as additional supervision through external resources or tools like word clusters and word normalization. We remove more than half of the error of the standard tagger when applied to nonstandard texts by training it on a combination of standard and non-standard training data, while enriching the data representation with external resources removes additional 11 percent of the error. The final configuration achieves tagging accuracy of 87.41% on the full morphosyntactic description, which is, nevertheless, still quite far from the accuracy of 94.27% achieved on standard text.

[1]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[2]  Yves Scherrer,et al.  Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation , 2016, KONVENS.

[3]  Špela Arhar Holdt,et al.  CMC training corpus Janes-Tag 1.2 , 2016 .

[4]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[5]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[6]  Tomaz Erjavec,et al.  MULTEXT-East: morphosyntactic resources for Central and Eastern European languages , 2011, Language Resources and Evaluation.

[7]  Špela Arhar Holdt,et al.  CMC training corpus Janes-Norm 1.2 , 2016 .

[8]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[9]  Oliver Christ,et al.  A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[10]  Tomaž Erjavec,et al.  Normalising Slovene data: historical texts vs. user-generated content , 2016, KONVENS.

[11]  Nikola Ljubešić,et al.  Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets , 2016 .

[12]  Tomaz Erjavec,et al.  Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene , 2016, LREC.

[13]  Nikola Ljubesic,et al.  New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian , 2016, LREC.

[14]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[15]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[16]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[17]  Torsten Zesch,et al.  Effectiveness of Domain Adaptation Approaches for Social Media PoS Tagging , 2015 .

[18]  Stefan Evert,et al.  EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora , 2016, WAC@ACL.

[19]  Dirk Hovy,et al.  Adapting taggers to Twitter with not-so-distant supervision , 2014, COLING.

[20]  Tomaz Erjavec,et al.  Gold-Standard Datasets for Annotation of Slovene Computer-Mediated Communication , 2016, RASLAN.

[21]  Tomaž Erjavec,et al.  JANES v0.4: Korpus slovenskih spletnih uporabniških vsebin , 2016 .

[22]  Tomaž Erjavec The slWaC Corpus of the Slovene Web , .

[23]  Tomaz Erjavec,et al.  Predicting the Level of Text Standardness in User-generated Content , 2015, RANLP.

[24]  András Kornai,et al.  HunPos: an open source trigram tagger , 2007, ACL 2007.

[25]  Nikola Ljubesic,et al.  Lemmatization and Morphosyntactic Tagging of Croatian and Serbian , 2013, BSNLP@ACL.