Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation

In its daily use, the Indonesian language is riddled with informality, that is, deviations from the standard in terms of vocabulary, spelling, and word order. On the other hand, current available Indonesian NLP models are typically developed with the standard Indonesian in mind. In this work, we address a style-transfer from informal to formal Indonesian as a lowresource machine translation problem. We build a new dataset of parallel sentences of informal Indonesian and its formal counterpart. We benchmark several strategies to perform style transfer from informal to formal Indonesian. We also explore augmenting the training set with artificial forward-translated data. Since we are dealing with an extremely low-resource setting, we find that a phrase-based machine translation approach outperforms the Transformer-based approach. Alternatively, a pre-trained GPT-2 fined-tuned to this task performed equally well but costs more computational resource. Our findings show a promising step towards leveraging machine translation models for style transfer. Our code and data are available in https://github.com/haryoa/stif-indonesia.

[1]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[2]  Alham Fikri Aji,et al.  Benchmarking Multidomain English-Indonesian Machine Translation , 2020, BUCC@LREC.

[3]  Gholamreza Haffari,et al.  Iterative Back-Translation for Neural Machine Translation , 2018, NMT@ACL.

[4]  Ralph Grishman,et al.  Paraphrasing for Style , 2012, COLING.

[5]  David Moeljadi,et al.  Building Cendana: a Treebank for Informal Indonesian , 2019 .

[6]  Mirna Adriani,et al.  Normalization of Indonesian-English Code-Mixed Twitter Data , 2019, EMNLP.

[7]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[8]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[9]  Eric P. Xing,et al.  Unsupervised Text Style Transfer using Language Models as Discriminators , 2018, NeurIPS.

[10]  Scott Horan Paauw,et al.  The Malay contact varieties of eastern Indonesia: A typological comparison , 2009 .

[11]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[12]  Harsh Jhamtani,et al.  Shakespearizing Modern Language Using Copy-Enriched Sequence to Sequence Models , 2017, Proceedings of the Workshop on Stylistic Variation.

[13]  Joel R. Tetreault,et al.  Dear Sir or Madam, May I Introduce the GYAFC Dataset: Corpus, Benchmarks and Metrics for Formality Style Transfer , 2018, NAACL.

[14]  Dongyan Zhao,et al.  Semi-supervised Text Style Transfer: Cross Projection in Latent Space , 2019, EMNLP.

[15]  Regina Barzilay,et al.  Style Transfer from Non-Parallel Text by Cross-Alignment , 2017, NIPS.

[16]  Ali Akbar Septiandri,et al.  Colloquial Indonesian Lexicon , 2018, 2018 International Conference on Asian Language Processing (IALP).

[17]  André F. T. Martins,et al.  Marian: Fast Neural Machine Translation in C++ , 2018, ACL.

[18]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[19]  Benoît Sagot,et al.  Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Rico Sennrich,et al.  Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation , 2019, ArXiv.

[22]  Stephan Vogel,et al.  Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.

[23]  Jie Zhou,et al.  A Dual Reinforcement Learning Framework for Unsupervised Text Style Transfer , 2019, IJCAI.