Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model

We present an inference algorithm that organizes observed words (tokens) into structured inflectional paradigms (types). It also naturally predicts the spelling of unobserved forms that are missing from these paradigms, and discovers inflectional principles (grammar) that generalize to wholly unobserved words. Our Bayesian generative model of the data explicitly represents tokens, types, inflections, paradigms, and locally conditioned string edits. It assumes that inflected word tokens are generated from an infinite mixture of inflectional paradigms (string tuples). Each paradigm is sampled all at once from a graphical model, whose potential functions are weighted finite-state transducers with language-specific parameters to be learned. These assumptions naturally lead to an elegant empirical Bayes inference procedure that exploits Monte Carlo EM, belief propagation, and dynamic programming. Given 50--100 seed paradigms, adding a 10-million-word corpus reduces prediction error for morphological inflections by up to 10%.

[1]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[2]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[3]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[4]  Kristina Toutanova,et al.  Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[5]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[6]  École d'été de probabilités de Saint-Flour,et al.  École d'été de probabilités de Saint-Flour XIII - 1983 , 1985 .

[7]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[8]  Gregory Stump,et al.  Inflectional Morphology: Conclusions, extensions, and alternatives , 2001 .

[9]  Z. Harris From Phoneme to Morpheme , 1955 .

[10]  Alon Lavie,et al.  ParaMor: Minimally Supervised Induction of Paradigm Structure and Morphological Analysis , 2007, SIGMORPHON.

[11]  David Yarowsky,et al.  Minimally Supervised Morphological Analysis by Multimodal Alignment , 2000, ACL.

[12]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[13]  Stephen F. Weiss,et al.  Word segmentation by letter successor varieties , 1974, Inf. Storage Retr..

[14]  Chuan Yi Tang,et al.  A 2.|E|-Bit Distributed Algorithm for the Directed Euler Trail Problem , 1993, Inf. Process. Lett..

[15]  Marco Baroni,et al.  Unsupervised discovery of morphologically related words based on orthographic and semantic similarity , 2002, SIGMORPHON.

[16]  A. E. Albright,et al.  The identification of bases in morphological paradigms , 2002 .

[17]  Noah A. Smith,et al.  Context-Based Morphological Disambiguation with Random Fields , 2005, HLT.

[18]  Peter I. Frazier,et al.  Distance dependent Chinese restaurant processes , 2009, ICML.

[19]  Markus Dreyer,et al.  Latent-Variable Modeling of String Transductions with Finite-State Methods , 2008, EMNLP.

[20]  Mikko Kurimo,et al.  Morpho Challenge competition 2005--2010: evaluations and results , 2010, ACL 2010.

[21]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[22]  Daniel Jurafsky,et al.  Knowledge-Free Induction of Inflectional Morphologies , 2001, NAACL.

[23]  Martin Kay,et al.  Nonconcatenative Finite-State Morphology , 1987, EACL.

[24]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[25]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[26]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[27]  P. Matthews Inflectional Morphology: A Theoretical Study Based on Aspects of Latin Verb Conjugation , 1972 .

[28]  D. Aldous Exchangeability and related topics , 1985 .

[29]  Mathias Creutz,et al.  Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 , 2005 .

[30]  Markus Dreyer,et al.  Graphical Models over Multiple Strings , 2009, EMNLP.

[31]  Sharon Goldwater,et al.  Improving morphology induction by learning spelling rules , 2009, IJCAI 2009.

[32]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[33]  Erwin Chan,et al.  Learning Probabilistic Paradigms for Morphology in a Latent Class Model , 2006, SIGMORPHON.

[34]  Hervé Déjean Morphemes as Necessary Concept for Structures Discovery from Untagged Corpora , 1998, CoNLL.

[35]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[36]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[37]  Markus Dreyer,et al.  A non-parametric model for the discovery of inflectional paradigms from plain text using graphical models over strings , 2011 .

[38]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[39]  Thomas L. Griffiths,et al.  Interpolating between types and tokens by estimating power-law generators , 2005, NIPS.

[40]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[41]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .