论文信息 - Corpus-Based Diacritic Restoration for South Slavic Languages

Corpus-Based Diacritic Restoration for South Slavic Languages

In computer-mediated communication, Latin-based scripts users often omit diacritics when writing. Such text is typically easily understandable to humans but very difficult for computational processing because many words become ambiguous or unknown. Letter-level approaches to diacritic restoration generalise better and do not require a lot of training data but word-level approaches tend to yield better results. However, they typically rely on a lexicon which is an expensive resource, not covering non-standard forms, and often not available for less-resourced languages. In this paper we present diacritic restoration models that are trained on easy-to-acquire corpora. We test three different types of corpora (Wikipedia, general web, Twitter) for three South Slavic languages (Croatian, Serbian and Slovene) and evaluate them on two types of text: standard (Wikipedia) and non-standard (Twitter). The proposed approach considerably outperforms charlifter, so far the only open source tool available for this task. We make the best performing systems freely available.

Tomaz Erjavec | Darja Fiser | Nikola Ljubesic

[1] Nikola Ljubesic,et al. Discriminating Between Closely Related Languages on Twitter , 2015, Informatica.

[2] Tomaz Erjavec,et al. Predicting the Level of Text Standardness in User-generated Content , 2015, RANLP.

[3] David Yarowsky,et al. A Comparison of Corpus-Based Techniques for Restoring Accents in Spanish and French Text , 1999 .

[4] Rada Mihalcea,et al. Letter Level Learning for Language Independent Diacritics Restoration , 2002, CoNLL.

[5] Dan Tufis,et al. DIAC+: a Professional Diacritics Recovering System , 2008, LREC.

[6] Kenneth Heafield,et al. KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[7] Borbála Siklósi,et al. Automatic Diacritics Restoration for Hungarian , 2015, EMNLP.

[8] Nikola Ljubesic,et al. {bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian , 2014, WaC@EACL.

[9] J. Šnajder,et al. Automatic Diacritics Restoration in Croatian Texts , 2009 .

[10] Tomaž Erjavec,et al. The slWaC 2 . 0 Corpus of the Slovene Web , 2014 .

[11] Tomaz Erjavec,et al. TweetCaT: a tool for building Twitter corpora of smaller languages , 2014, LREC.

[12] A G N,et al. Bibliographical References , 1965 .

[13] Michel Simard. Automatic Insertion of Accents in French Text , 1998, EMNLP.

[14] David Yarowsky,et al. DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION: Application to Accent Restoration in Spanish and French , 1994, ACL.