Improving Lemmatization of Non-Standard Languages with Joint Learning

Lemmatization of standard languages is concerned with (i) abstracting over morphological differences and (ii) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword. In the present paper we aim to improve lemmatization performance on a set of non-standard historical languages in which the difficulty is increased by an additional aspect (iii): spelling variation due to lacking orthographic standards. We approach lemmatization as a string-transduction task with an encoder-decoder architecture which we enrich with sentence context information using a hierarchical sentence encoder. We show significant improvements over the state-of-the-art when training the sentence encoder jointly for lemmatization and language modeling. Crucially, our architecture does not require POS or morphological annotations, which are not always available for historical corpora. Additionally, we also test the proposed model on a set of typologically diverse standard languages showing results on par or better than a model without enhanced sentence representations and previous state-of-the-art systems. Finally, to encourage future work on processing of non-standard varieties, we release the dataset of non-standard languages underlying the present study, based on openly accessible sources.

[1]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[2]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[3]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[4]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[5]  Michael Piotrowski,et al.  Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[6]  Matti Lassila,et al.  Abbreviations, fragmentary words, formulaic language: treebanking mediaeval charter material , 2013 .

[7]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[8]  David Crystal,et al.  Language and the Internet , 2001 .

[9]  Tanja Samardzic,et al.  Lemmatisation as a Tagging Task , 2012, ACL.

[10]  Hans van Halteren,et al.  Dealing with orthographic variation in a tagger-lemmatizer for fourteenth century Dutch charters , 2013, Lang. Resour. Evaluation.

[11]  Daniel Kondratyuk,et al.  LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs , 2018, EMNLP.

[12]  Walter Daelemans,et al.  Lemmatization for variation-rich languages using deep learning , 2016, Digit. Scholarsh. Humanit..

[13]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[14]  Pieter van Reenen,et al.  Een gegevensbank van 14de-eeuwse Middelnederlandse dialecten op computer , 2013 .

[15]  Alexander Mehler,et al.  Lemmatization and Morphological Tagging in German and Latin: A Comparison and a Survey of the State-of-the-art , 2016, LREC.

[16]  Tomaž Erjavec Reference corpus of historical Slovene goo300k 1.2 , 2015 .

[17]  Josef van Genabith,et al.  Learning Morphology with Morfette , 2008, LREC.

[18]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[19]  Joakim Nivre,et al.  A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text , 2014, LaTeCH@EACL.

[20]  Walter Daelemans,et al.  Multimodular Text Normalization of Dutch User-Generated Content , 2016, ACM Trans. Intell. Syst. Technol..

[21]  Joakim Nivre,et al.  An Evaluation of Neural Machine Translation Models on Historical Spelling Normalization , 2018, COLING.

[22]  Sharon Goldwater,et al.  Context Sensitive Neural Lemmatization with Lematus , 2018, NAACL-HLT.

[23]  Dirk Hovy,et al.  What’s in a p-value in NLP? , 2014, CoNLL.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[26]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[27]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[28]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[29]  Zuraidah Mohd Don,et al.  The notion of a “lemma”: Headwords, roots and lexical sets , 2004 .

[30]  Alexander M. Fraser,et al.  Joint Lemmatization and Morphological Tagging with Lemming , 2015, EMNLP.

[31]  Oksana Dereza,et al.  Lemmatization for Ancient Languages: Rules or Neural Networks? , 2018, AINL 2018.

[32]  Fabian Barteld,et al.  Das Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200–1650) – Korpusdesign, Korpuserstellung und Korpusnutzung , 2017 .

[33]  Utpal Garain,et al.  Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks , 2017, ACL.

[34]  Anders Søgaard,et al.  Improving historical spelling normalization with bi-directional LSTMs and multi-task learning , 2016, COLING.