Summarising Historical Text in Modern Languages

We introduce the task of historical text summarisation, where documents in historical forms of a language are summarised in the corresponding modern language. This is a fundamentally important routine to historians and digital humanities researchers but has never been automated. We compile a high-quality gold-standard text summarisation dataset, which consists of historical German and Chinese news from hundreds of years ago summarised in modern German or Chinese. Based on cross-lingual transfer learning techniques, we propose a summarisation model that can be trained even with no cross-lingual (historical to modern) parallel data, and further benchmark it against state-of-the-art algorithms. We report automatic and human evaluations that distinguish the historic to modern language summarisation task from standard cross-lingual summarisation (i.e., modern to modern language), highlight the distinctness and value of our dataset, and demonstrate that our transfer learning approach outperforms standard cross-lingual benchmarks on this task.

[1]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[2]  Marcin Sydow,et al.  Introducing Diversity to Log-Based Query Suggestions to Deal with Underspecified User Queries , 2011, SIIS.

[3]  Anke Lüdeling,et al.  RIDGES Herbology: designing a diachronic multi-layer corpus , 2017, Lang. Resour. Evaluation.

[4]  Min Zhang,et al.  Zero-Shot Cross-Lingual Abstractive Sentence Summarization through Teaching Generation and Attention , 2019, ACL.

[5]  S. Gunn,et al.  Research Methods for History , 2011 .

[6]  Vivi Nastase,et al.  Induction of a Large-Scale Knowledge Graph from the Regesta Imperii , 2018, LaTeCH@COLING.

[7]  Michael Piotrowski,et al.  Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[8]  Jieh Hsiang,et al.  DocuSky, A Personal Digital Humanities Platform for Scholars , 2020, Journal of Chinese History.

[9]  Aynat Rubinstein,et al.  Historical corpora meet the digital humanities: the Jerusalem Corpus of Emergent Modern Hebrew , 2019, Lang. Resour. Evaluation.

[10]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[11]  Dian Yu,et al.  CLUE: A Chinese Language Understanding Evaluation Benchmark , 2020, COLING.

[12]  Bin Wang,et al.  Evaluating word embedding models: methods and experimental results , 2019, APSIPA Transactions on Signal and Information Processing.

[13]  Xiaoyong Du,et al.  Analogical Reasoning on Chinese Morphological and Semantic Relations , 2018, ACL.

[14]  Weiguo Sheng A Hybrid Learning Scheme for Chinese Word Embedding , 2018, Rep4NLP@ACL.

[15]  Jianxin Zhang,et al.  Sentence Alignment Method Based on Maximum Entropy Model Using Anchor Sentences , 2016, CCL.

[16]  Marcel Bollmann,et al.  A Large-Scale Comparison of Historical Text Normalization Systems , 2019, NAACL.

[17]  Rotem Dror,et al.  The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.

[18]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[19]  Hermann Ney,et al.  When and Why is Unsupervised Neural Machine Translation Useless? , 2020, EAMT.

[20]  Xiaojun Wan,et al.  Overview of the NLPCC 2017 Shared Task: Single Document Summarization , 2017, NLPCC.

[21]  Tomaž Erjavec,et al.  Normalising Slovene data: historical texts vs. user-generated content , 2016, KONVENS.

[22]  Simon J. Greenhill,et al.  Language evolution and human history: what a difference a date makes , 2011, Philosophical Transactions of the Royal Society B: Biological Sciences.

[23]  Anders Søgaard,et al.  A Survey of Cross-lingual Word Embedding Models , 2017, J. Artif. Intell. Res..

[24]  Zhenglu Yang,et al.  Attention Optimization for Abstractive Document Summarization , 2019, EMNLP.

[25]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[26]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[27]  Jure Leskovec,et al.  Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora , 2016, EMNLP.

[28]  Jiajun Zhang,et al.  NCLS: Neural Cross-Lingual Summarization , 2019, EMNLP.

[29]  Yves Scherrer,et al.  Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation , 2016, KONVENS.

[30]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[31]  Mary Lindemann A Ruler’s Consort in Early Modern Germany: Aemilia Juliana of Schwarzburg-Rudolstadt , 2015 .

[32]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[33]  Eva Pettersson,et al.  HistSearch - Implementation and Evaluation of a Web-based Tool for Automatic Information Extraction from Historical Text , 2016, HistoInformatics@DH.

[34]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[35]  Constantin Orasan,et al.  Evaluation of a Cross-lingual Romanian-English Multi-document Summariser , 2008, LREC.

[36]  Qingcai Chen,et al.  LCSTS: A Large Scale Chinese Short Text Summarization Dataset , 2015, EMNLP.

[37]  Jiajun Zhang,et al.  Attend, Translate and Summarize: An Efficient Method for Neural Cross-Lingual Summarization , 2020, ACL.

[38]  Stanley A. South,et al.  Method and Theory in Historical Archeology , 2002 .

[39]  Xiaojun Wan,et al.  Jointly Learning to Align and Summarize for Neural Cross-Lingual Summarization , 2020, ACL.

[40]  Dawn Archer,et al.  Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora , 2007 .

[41]  Benjamin Van Durme,et al.  Annotated Gigaword , 2012, AKBC-WEKEX@NAACL-HLT.

[42]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[43]  Anton Leuski,et al.  Cross-lingual C*ST*RD: English access to Hindi information , 2003, TALIP.