Ancient Korean Neural Machine Translation

Translation of the languages of ancient times can serve as a source for the content of various digital media and can be helpful in various fields such as natural phenomena, medicine, and science. Owing to these needs, there has been a global movement to translate ancient languages, but expert minds are required for this purpose. It is difficult to train language experts, and more importantly, manual translation is a slow process. Consequently, the recovery of ancient characters using machine translation has been recently investigated, but there is currently no literature on the machine translation of ancient Korean. This paper proposes the first ancient Korean neural machine translation model using a Transformer. This model can improve the efficiency of a translator by quickly providing a draft translation for a number of untranslated ancient documents. Furthermore, a new subword tokenization method called the Share Vocabulary and Entity Restriction Byte Pair Encoding is proposed based on the characteristics of ancient Korean sentences. This proposed method yields an increase in the performance of the original conventional subword tokenization methods such as byte pair encoding by 5.25 BLEU points. In addition, various decoding strategies such as n-gram blocking and ensemble models further improve the performance by 2.89 BLEU points. The model has been made publicly available as a software application.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  Matthias Sperber,et al.  Neural Lattice-to-Sequence Models for Uncertain Inputs , 2017, EMNLP.

[3]  Robert J. Fouser,et al.  Understanding Korean literature , 1998 .

[4]  Yannis Assael,et al.  Restoring ancient text using deep learning: a case study on Greek epigraphy , 2019, EMNLP.

[5]  S. Arikawa,et al.  Byte Pair Encoding: a Text Compression Scheme That Accelerates Pattern Matching , 1999 .

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[8]  So-Young Jeong,et al.  Investigation of conservation state on the waxed volumes of annals of the Joseon Dynasty , 2004 .

[9]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[10]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[11]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[12]  Park Hyunju Joseon Wangjo Sillok Translation Projects: What their accomplishments are and how they can be better used , 2018 .

[13]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[14]  Alexander M. Fraser,et al.  How Language-Neutral is Multilingual BERT? , 2019, ArXiv.

[15]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[16]  Seo hanseok Present stratus of Uigwe translation andways to improve Uigwe translation. , 2013 .

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[19]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[20]  Yidong Chen,et al.  Lattice-to-sequence attentional Neural Machine Translation models , 2018, Neurocomputing.

[21]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[22]  Ki-Ju Choi,et al.  Scanning Electron Microscope Study of Ancient Parasite Eggs Recovered From Korean Mummies of the Joseon Dynasty , 2009, The Journal of parasitology.

[23]  Alex Lamb,et al.  KuroNet: Pre-Modern Japanese Kuzushiji Character Recognition with Deep Learning , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[24]  Karl Reinhard,et al.  Paleoparasitological Studies on Mummies of the Joseon Dynasty, Korea , 2014, The Korean journal of parasitology.

[25]  Alex Lamb,et al.  Deep Learning for Classical Japanese Literature , 2018, ArXiv.

[26]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[27]  W. Bruce Croft,et al.  Efficient indexing of repeated n-grams , 2011, WSDM '11.

[28]  Rongrong Ji,et al.  Lattice-Based Recurrent Neural Network Encoders for Neural Machine Translation , 2016, AAAI.

[29]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[30]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.