Universal Lemmatizer: A sequence-to-sequence model for lemmatizing Universal Dependencies treebanks

In this paper we present a novel lemmatization method based on a sequence-to-sequence neural network architecture and morphosyntactic context representation. In the proposed method, our context-sensitive lemmatizer generates the lemma one character at a time based on the surface form characters and its morphosyntactic features obtained from a morphological tagger. We argue that a sliding window context representation suffers from sparseness, while in majority of cases the morphosyntactic features of a word bring enough information to resolve lemma ambiguities while keeping the context representation dense and more practical for machine learning systems. Additionally, we study two different data augmentation methods utilizing autoencoder training and morphological transducers especially beneficial for low resource languages. We evaluate our lemmatizer on 52 different languages and 76 different treebanks, showing that our system outperforms all latest baseline systems. Compared to the best overall baseline, UDPipe Future, our system outperforms it on 62 out of 76 treebanks reducing errors on average by 19% relative. The lemmatizer together with all trained models is made available as a part of the Turku-neural-parsing-pipeline under the Apache 2.0 license.

[1]  Katharina Kann,et al.  The LMU System for the CoNLL-SIGMORPHON 2017 Shared Task on Universal Morphological Reinflection , 2017, CoNLL.

[2]  Katharina Kann,et al.  Training Data Augmentation for Low-Resource Morphological Inflection , 2017, CoNLL.

[3]  Tommi A Pirinen Neural and rule-based Finnish NLP models—expectations, experiments and experiences , 2019 .

[4]  Joakim Nivre,et al.  Universal Dependencies , 2017, EACL.

[5]  Ryan Cotterell,et al.  CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages , 2017, CoNLL.

[6]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Francis M. Tyers,et al.  Free/Open-Source Resources in the Apertium Platform for Machine Translation Research and Development , 2010, Prague Bull. Math. Linguistics.

[9]  Milan Straka,et al.  CoNLL 2017 Shared Task - UDPipe Baseline Models and Supplementary Materials , 2017 .

[10]  Ling Liu,et al.  Evaluation of Finite State Morphological Analyzers Based on Paradigm Extraction from Wiktionary. , 2017 .

[11]  Tapio Salakoski,et al.  Turku Neural Parser Pipeline: An End-to-End System for the CoNLL 2018 Shared Task , 2018, CoNLL.

[12]  Timothy Dozat,et al.  Universal Dependency Parsing from Scratch , 2019, CoNLL.

[13]  Utpal Garain,et al.  Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks , 2017, ACL.

[14]  Noah A. Smith,et al.  Context-Based Morphological Disambiguation with Random Fields , 2005, HLT.

[15]  Lauri Karttunen,et al.  Two-level rule compiler , 1992 .

[16]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[17]  Josef van Genabith,et al.  Learning Morphology with Morfette , 2008, LREC.

[18]  Ahmet Aker,et al.  An Extensible Multilingual Open Source Lemmatizer , 2017, RANLP.

[19]  Johannes Bjerva,et al.  SU-RUG at the CoNLL-SIGMORPHON 2017 shared task: Morphological Inflection with Attentional Sequence-to-Sequence Models , 2017, CoNLL Shared Task.

[20]  Alberto Costa,et al.  RBFOpt: an open-source library for black-box optimization with costly function evaluations , 2018, Mathematical Programming Computation.

[21]  Timothy Dozat,et al.  Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task , 2017, CoNLL.

[22]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[23]  Ryan Cotterell,et al.  Marrying Universal Dependencies and Universal Morphology , 2018, UDW@EMNLP.

[24]  Anssi Yli-Jyrä Bounded-Depth High-Coverage Search Space for Noncrossing Parses , 2017 .

[25]  Quoc V. Le,et al.  Addressing the Rare Word Problem in Neural Machine Translation , 2014, ACL.

[26]  Péter Rebrus,et al.  Morphdb.hu: Hungarian lexical database and morphological grammar , 2006, LREC.

[27]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[28]  Sharon Goldwater,et al.  Context Sensitive Neural Lemmatization with Lematus , 2018, NAACL-HLT.

[29]  Nizar Habash,et al.  CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2017, CoNLL.

[30]  Jan Hajic,et al.  UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing , 2016, LREC.

[31]  Milan Straka,et al.  UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task , 2018, CoNLL.

[32]  Daniel Kondratyuk,et al.  LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs , 2018, EMNLP.

[33]  Christo Kirov,et al.  Very-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms , 2016, LREC.

[34]  Daniel Zeman,et al.  CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings , 2017 .

[35]  Çağrı Çöltekin,et al.  A Freely Available Morphological Analyzer for Turkish , 2010, LREC.

[36]  Alexander M. Fraser,et al.  Joint Lemmatization and Morphological Tagging with Lemming , 2015, EMNLP.

[37]  Katharina Kann,et al.  Unlabeled Data for Morphological Generation With Character-Based Sequence-to-Sequence Models , 2017, SWCN@EMNLP.

[38]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[39]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[40]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.