Unsupervised Learning of the Morphology of a Natural Language

This study reports the results of using minimum description length (MDL) analysis to model unsupervised learning of the morphological segmentation of European languages, using corpora ranging in size from 5,000 words to 500,000 words. We develop a set of heuristics that rapidly develop a probabilistic morphological grammar, and use MDL as our primary tool to determine whether the modifications proposed by the heuristics will be adopted or not. The resulting grammar matches well the analysis that would be developed by a human morphologist. In the final section, we discuss the relationship of this style of MDL grammatical analysis to the notion of evaluation metric in early generative grammar.

[1]  Saso Dzeroski,et al.  Induction of Slovene Nominal Paradigms , 1997, ILP.

[2]  C. Habel,et al.  Language , 1931, NeuroImage.

[3]  Ray J. Solomonoff The discovery of algorithmic probability: A guide for the programming of true creativity , 1995, EuroCOLT.

[4]  Gabriel Altmann,et al.  Einführung in die quantitative phonologie , 1984 .

[5]  Eugene A. Nida,et al.  Morphology : the descriptive analysis of words , 1947 .

[6]  Eric Gaussier,et al.  Unsupervised learning of derivational morphology from inflectional lexicons , 1999 .

[7]  G. Flenner Quantitative Morphsegmentierung im Spanischen auf phonologischer Basis , 1995 .

[8]  Christian Kleinewächter,et al.  On identification , 2005, Electron. Notes Discret. Math..

[9]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[10]  Milos Pacak,et al.  Automated morphosyntactic analysis of medical language , 1976, Inf. Process. Manag..

[11]  泽熙 信息时代的in the Information管理 , 2000 .

[12]  Z. Harris From Phoneme to Morpheme , 1955 .

[13]  John Goldsmith,et al.  On Information theory, entropy, and phonology in the 20th century , 2000 .

[14]  H. A. Gleason,et al.  The Identification of Morphemes , 1994 .

[15]  Lise M. Dobrin Phonological form, morphological class, and syntactic gender : the noun class systems of Papua New Guinea Arapeshan , 1999 .

[16]  Thiruvengadam Radhakrishnan Selection of prefix and postfix word fragments for data compression , 1978, Inf. Process. Manag..

[17]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  L. Karttunen Finite-state Constraints , 1993 .

[20]  Stephen F. Weiss,et al.  Word segmentation by letter successor varieties , 1974, Inf. Storage Retr..

[21]  Hagen Langer Ein automatisches Morphsegmentierungsverfahren für deutsche Wortformen , 1991 .

[22]  M. Brent Speech segmentation and word discovery: a computational perspective , 1999, Trends in Cognitive Sciences.

[23]  John Goldsmith,et al.  Automatic Collection and Analysis of GermanCompounds , 1998 .

[24]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[25]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[26]  Zellig S. Harris,et al.  Morpheme Boundaries within Words: Report on a Computer Test , 1970 .

[27]  Dimitar Kazakov Unsupervised Learning of Naive Morphology with Genetic Algorithms , 1997 .

[28]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[29]  Haruo Kubozono,et al.  Autosegmental and metrical phonology. By JOHN A GOLDSMITH. Oxford: Basil Blackwell, 1990. vii, 376 , 1991 .

[30]  Christian Jacquemin,et al.  Guessing morphology from terms and corpora , 1997, SIGIR '97.

[31]  Rudolf Schmidt,et al.  The linguistic knowledge in a morphological segmentation procedure for German , 1994, Comput. Speech Lang..

[32]  J. Goldsmith Autosegmental and Metrical Phonology , 1990 .

[33]  Zellig S. Harris,et al.  Papers in structural and transformational linguistics , 1951 .

[34]  Noam Chomsky,et al.  The Logical Structure of Linguistic Theory , 1975 .