Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches

With the growth of the social web, user-generated text data has reached unprecedented sizes. Non-canonical text normalization provides a way to exploit this as a practical source of training data for language processing systems. The state of the art in Turkish text normalization is composed of a token level pipeline of modules, heavily dependent on external linguistic resources and manually defined rules. Instead, we propose a fully automated, context-aware machine translation approach with fewer stages of processing. Experiments with various implementations of our approach show that we are able to surpass the current best-performing system by a large margin.

[1]  Kemal Oflazer,et al.  Two-level Description of Turkish Morphology , 1993, EACL.

[2]  Burcu Can,et al.  Neural Text Normalization for Turkish Social Media , 2018, 2018 3rd International Conference on Computer Science and Engineering (UBMK).

[3]  Yves Scherrer,et al.  Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation , 2016, KONVENS.

[4]  Caglar Tirkaz,et al.  A Morphology-Aware Network for Morphological Disambiguation , 2016, AAAI.

[5]  Yang Liu,et al.  Normalization of Text Messages Using Character- and Phone-based Machine Translation Approaches , 2012, INTERSPEECH.

[6]  David Matthews,et al.  Machine Transliteration of Proper Names , 2007 .

[7]  Gülsen Eryigit,et al.  The Annotation Process of the ITU Web Treebank , 2015, LAW@NAACL-HLT.

[8]  GÜLŞEN ERYİǦİT,et al.  Social media text normalization for Turkish , 2017, Natural Language Engineering.

[9]  Gülsen Eryigit,et al.  ITU Turkish NLP Web Service , 2014, EACL.

[10]  Hiroyuki Shindo,et al.  Japanese Text Normalization with Encoder-Decoder Model , 2016, NUT@COLING.

[11]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[12]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[13]  Preslav Nakov,et al.  Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages , 2012, ACL.

[14]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[15]  Osman Tursun,et al.  Noisy Uyghur Text Normalization , 2017, NUT@EMNLP.

[16]  Stephan Vogel,et al.  Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.

[17]  Yves Scherrer,et al.  Modernizing historical Slovene words with character-based SMT , 2013, BSNLP@ACL.

[18]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[19]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[20]  Jörg Tiedemann,et al.  An SMT Approach to Automatic Annotation of Historical Text , 2013 .