Leveraging Inflection Tables for Stemming and Lemmatization

We present several methods for stemming and lemmatization based on discriminative string transduction. We exploit the paradigmatic regularity of semi-structured inflection tables to identify stems in an unsupervised manner with over 85% accuracy. Experiments on English, Dutch and German show that our stemmers substantially outperform Snowball and Morfessor, and approach the accuracy of a supervised model. Furthermore, the generated stems are more consistent than those annotated by experts. Our direct lemmatization model is more accurate than Morfette and Lemming on most datasets. Finally, we test our methods on the data from the shared task on morphological reinflection.

[1]  Grzegorz Kondrak,et al.  Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion , 2007, NAACL.

[2]  Alexander M. Fraser,et al.  Joint Lemmatization and Morphological Tagging with Lemming , 2015, EMNLP.

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[5]  Josef van Genabith,et al.  Learning Morphology with Morfette , 2008, LREC.

[6]  Alexander M. Fraser,et al.  Labeled Morphological Segmentation with Semi-Markov Models , 2015, CoNLL.

[7]  John DeNero,et al.  Supervised Learning of Complete Morphological Paradigms , 2013, NAACL.

[8]  Mathias Creutz,et al.  INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT , 2005 .

[9]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[10]  Mikko Kurimo,et al.  Painless Semi-Supervised Morphological Segmentation using Conditional Random Fields , 2014, EACL.

[11]  Kristina Toutanova,et al.  A global model for joint lemmatization and part-of-speech prediction , 2009, ACL.

[12]  Hinrich Schütze,et al.  Efficient Higher-Order CRFs for Morphological Tagging , 2013, EMNLP.

[13]  Mathias Creutz,et al.  Induction of a Simple Morphology for Highly-Inflecting Languages , 2004, SIGMORPHON@ACL.

[14]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[15]  Ryan Cotterell,et al.  The SIGMORPHON 2016 Shared Task—Morphological Reinflection , 2016, SIGMORPHON.

[16]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[17]  Mikko Kurimo,et al.  Morfessor FlatCat: An HMM-Based Method for Unsupervised and Semi-Supervised Learning of Morphology , 2014, COLING.

[18]  Richard Johansson,et al.  The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages , 2009, CoNLL Shared Task.

[19]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[20]  Hoifung Poon,et al.  Unsupervised Morphological Segmentation with Log-Linear Models , 2009, NAACL.

[21]  Grzegorz Kondrak,et al.  Integrating Joint n-gram Features into a Discriminative Training Framework , 2010, HLT-NAACL.