Representational Bias in Unsupervised Learning of Syllable Structure

Unsupervised learning algorithms based on Expectation Maximization (EM) are often straightforward to implement and provably converge on a local likelihood maximum. However, these algorithms often do not perform well in practice. Common wisdom holds that they yield poor results because they are overly sensitive to initial parameter values and easily get stuck in local (but not global) maxima. We present a series of experiments indicating that for the task of learning syllable structure, the initial parameter weights are not crucial. Rather, it is the choice of model class itself that makes the difference between successful and unsuccessful learning. We use a language-universal rule-based algorithm to find a good set of parameters, and then train the parameter weights using EM. We achieve word accuracy of 95.9% on German and 97.1% on English, as compared to 97.4% and 98.1% respectively for supervised training.

[1]  Jeffrey L. Elman,et al.  Generalization from Sparse Input , 2003 .

[2]  George Anton Kiraz,et al.  Multilingual syllabification using weighted finite-state transducers , 1998, SSW.

[3]  Karin Müller Automatic Detection of Syllable Boundaries Combining the Advantages of Treebank and Bracketed Corpora Training , 2001, ACL.

[4]  Glenn Carroll,et al.  Two Experiments on Learning Probabilistic Dependency Grammars from Corpora , 1992 .

[5]  Dan Klein,et al.  A Generative Constituent-Context Model for Improved Grammar Induction , 2002, ACL.

[6]  Dan Klein,et al.  Distributional phrase structure induction , 2001, CoNLL.

[7]  P. Smolensky,et al.  Optimality Theory: Constraint Interaction in Generative Grammar , 2004 .

[8]  J. Blevins The Syllable in Phonological Theory , 1995 .

[9]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[10]  Michele Banko,et al.  Part-of-Speech Tagging in Context , 2004, COLING.

[11]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[12]  Anand Venkataraman,et al.  A Statistical Model for Word Discovery in Transcribed Speech , 2001, CL.

[13]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[14]  Radford M. Neal A new view of the EM algorithm that justifies incremental and other variants , 1993 .

[15]  Karin Müller Probabilistic Context-Free Grammars for Phonology , 2002, SIGMORPHON.

[16]  Walter Daelemans,et al.  Modularity in Inductively-Learned Word Pronunciation Systems , 1998, CoNLL.

[17]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.