论文信息 - Hybridized Character-Word Embedding for Korean Traditional Document Translation

Hybridized Character-Word Embedding for Korean Traditional Document Translation

Translating traditional documents is quite laborious and time consuming for human translators owing to the voluminous nature and a complexity of grammatical patterns. In recent times, a neural network-based machine translation architecture such as sequence-to-sequence (seq2seq) model showed superior performance in translation. However, it suffers out-of-vocabulary (OOV) issue when dealing with very complex and vocabulary languages such as Chinese characters, resulting in performance degradation. To cope with the OOV issue, we propose a new method by combining word embedding and character embedding to supplement loss from unknown words with character embedding. Experimental results show that the proposed method is efficient to translate old Korean archives (Hanja) to modern Korean documents (Hangul).

[1] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[2] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[3] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4] Zhiyuan Liu,et al. Joint Learning of Character and Word Embeddings , 2015, IJCAI.

[5] Lukás Burget,et al. Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Geoffrey Zweig,et al. Joint Language and Translation Modeling with Recurrent Neural Networks , 2013, EMNLP.

[7] Alexander M. Rush,et al. Character-Aware Neural Language Models , 2015, AAAI.

[8] Alex Graves,et al. Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[9] Jeffrey L. Elman,et al. Distributed Representations, Simple Recurrent Networks, and Grammatical Structure , 1991, Mach. Learn..

[10] Erik Cambria,et al. Label Embedding for Zero-shot Fine-grained Named Entity Typing , 2016, COLING.

[11] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[12] Geoffrey E. Hinton,et al. Generating Text with Recurrent Neural Networks , 2011, ICML.

[13] Cícero Nogueira dos Santos,et al. Boosting Named Entity Recognition with Neural Character Embeddings , 2015, NEWS@ACL.

[14] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[15] Hermann Ney,et al. Translation Modeling with Bidirectional Recurrent Neural Networks , 2014, EMNLP.

[16] Xiaoqing Zheng,et al. Deep Learning for Chinese Word Segmentation and POS Tagging , 2013, EMNLP.

[17] Sang-goo Lee,et al. (KKMA : A Tool for Utilizing Sejong Corpus based on Relational Database) , 2010 .