Restoring and Mining the Records of the Joseon Dynasty via Neural Language Modeling and Machine Translation

Understanding voluminous historical records provides clues on the past in various aspects, such as social and political issues and even natural science facts. However, it is generally difficult to fully utilize the historical records, since most of the documents are not written in a modern language and part of the contents are damaged over time. As a result, restoring the damaged or unrecognizable parts as well as translating the records into modern languages are crucial tasks. In response, we present a multi-task learning approach to restore and translate historical documents based on a self-attention mechanism, specifically utilizing two Korean historical records, ones of the most voluminous historical records in the world. Experimental results show that our approach significantly improves the accuracy of the translation task than baselines without multi-task learning. In addition, we present an in-depth exploratory analysis on our translated results via topic modeling, uncovering several significant historical events.

[1]  Alex Pappachen James,et al.  Ancient indian document analysis using cognitive memory network , 2014, 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[2]  John A. Eddy,et al.  The Maunder Minimum , 1976, Science.

[3]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[4]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[5]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[6]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[7]  Alice H. Oh,et al.  Conversational Decision-Making Model for Predicting the King’s Decision in the Annals of the Joseon Dynasty , 2018, EMNLP.

[8]  Ismail Haritaoglu,et al.  Shape-DNA: Effective Character Restoration and Enhancement for Arabic Text Documents , 2010, 2010 20th International Conference on Pattern Recognition.

[9]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[10]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[11]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[12]  Jingbo Zhu,et al.  A Simple and Effective Approach to Coverage-Aware Neural Machine Translation , 2018, ACL.

[13]  A. Waple,et al.  Solar Forcing of Regional Climate Change During the Maunder Minimum , 2001, Science.

[14]  Yannis Assael,et al.  Restoring ancient text using deep learning: a case study on Greek epigraphy , 2019, EMNLP.

[15]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[16]  Z. W. Zhang Korean Auroral Records Of The Period Ad 1507-1747 And The SAR Arcs , 1985 .

[17]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[18]  Hong-jin Yang,et al.  Orbital elements of comet C/1490 Y1 and the Quadrantid shower , 2009, 0908.2547.

[19]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[20]  D. Shin,et al.  Horse-riding accidents and injuries in historical records of Joseon Dynasty, Korea. , 2018, International journal of paleopathology.

[21]  Bin Wu,et al.  Sentiment analysis based on transfer learning for Chinese ancient literature , 2014, 2014 International Conference on Behavioral, Economic, and Socio-Cultural Computing (BESC2014).

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[23]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[24]  Luke S. Zettlemoyer,et al.  Cloze-driven Pretraining of Self-attention Networks , 2019, EMNLP.

[25]  Joakim Nivre,et al.  An Evaluation of Neural Machine Translation Models on Historical Spelling Normalization , 2018, COLING.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[28]  Alex Lamb,et al.  KuroNet: Pre-Modern Japanese Kuzushiji Character Recognition with Deep Learning , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[29]  Chulsang Yoo,et al.  Classification and evaluation of the documentary-recorded storm events in the Annals of the Choson Dynasty (1392-1910), Korea , 2015 .

[30]  Kai Chen,et al.  Convolutional Neural Networks for Page Segmentation of Historical Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[31]  Rada Mihalcea,et al.  Topic Modeling on Historical Newspapers , 2011, LaTeCH@ACL.

[32]  Jaegul Choo,et al.  Simultaneous Discovery of Common and Discriminative Topics via Joint Nonnegative Matrix Factorization , 2015, KDD.

[33]  Wanxiang Che,et al.  Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting , 2020, EMNLP.

[34]  Francisco Casacuberta Nolla,et al.  Spelling Normalization of Historical Documents by Using a Machine Translation Approach , 2018, EAMT.

[36]  David M. Mimno,et al.  Computational historiography: Data mining in a century of classics journals , 2012, JOCCH.

[37]  D. Knipp,et al.  Long-lasting Extreme Magnetic Storm Activities in 1770 Found in Historical Documents , 2017, 1711.00690.

[38]  Alice H. Oh,et al.  Five Centuries of Monarchy in Korea: Mining the Text of the Annals of the Joseon Dynasty , 2015, LaTeCH@ACL.

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Jiancheng Lv,et al.  Ancient–Modern Chinese Translation with a New Large Training Dataset , 2018, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[41]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[42]  Yue Zhang,et al.  Character-Level Chinese Dependency Parsing , 2014, ACL.

[43]  Qi Su,et al.  Automatic Translating Between Ancient Chinese and Contemporary Chinese with Limited Aligned Corpora , 2019, NLPCC.

[44]  Changbom Park,et al.  Analysis of Historical Meteor and Meteor shower Records: Korea, China and Japan , 2005, Proceedings of the International Astronomical Union.

[45]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[46]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[47]  Hai Zhao,et al.  Neural Character-level Dependency Parsing for Chinese , 2018, AAAI.