论文信息 - Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches

Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches

With the growth of the social web, user-generated text data has reached unprecedented sizes. Non-canonical text normalization provides a way to exploit this as a practical source of training data for language processing systems. The state of the art in Turkish text normalization is composed of a token level pipeline of modules, heavily dependent on external linguistic resources and manually defined rules. Instead, we propose a fully automated, context-aware machine translation approach with fewer stages of processing. Experiments with various implementations of our approach show that we are able to surpass the current best-performing system by a large margin.

A. Cüneyd Tantug | Umut Sulubacak | Talha Çolakoglu

[1] Kemal Oflazer,et al. Two-level Description of Turkish Morphology , 1993, EACL.

[2] Burcu Can,et al. Neural Text Normalization for Turkish Social Media , 2018, 2018 3rd International Conference on Computer Science and Engineering (UBMK).

[3] Yves Scherrer,et al. Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation , 2016, KONVENS.

[4] Caglar Tirkaz,et al. A Morphology-Aware Network for Morphological Disambiguation , 2016, AAAI.

[5] Yang Liu,et al. Normalization of Text Messages Using Character- and Phone-based Machine Translation Approaches , 2012, INTERSPEECH.

[6] David Matthews,et al. Machine Transliteration of Proper Names , 2007 .

[7] Gülsen Eryigit,et al. The Annotation Process of the ITU Web Treebank , 2015, LAW@NAACL-HLT.

[8] GÜLŞEN ERYİǦİT,et al. Social media text normalization for Turkish , 2017, Natural Language Engineering.

[9] Gülsen Eryigit,et al. ITU Turkish NLP Web Service , 2014, EACL.

[10] Hiroyuki Shindo,et al. Japanese Text Normalization with Encoder-Decoder Model , 2016, NUT@COLING.

[11] Jörg Tiedemann,et al. Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[12] Kenneth Heafield,et al. KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[13] Preslav Nakov,et al. Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages , 2012, ACL.

[14] Alexander M. Rush,et al. OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[15] Osman Tursun,et al. Noisy Uyghur Text Normalization , 2017, NUT@EMNLP.

[16] Stephan Vogel,et al. Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.

[17] Yves Scherrer,et al. Modernizing historical Slovene words with character-based SMT , 2013, BSNLP@ACL.

[18] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[19] Jörg Tiedemann,et al. OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[20] Jörg Tiedemann,et al. An SMT Approach to Automatic Annotation of Historical Text , 2013 .