Why Doesn’t EM Find Good HMM POS-Taggers?

This paper investigates why the HMMs estimated by Expectation-Maximization (EM) produce such poor results as Part-of-Speech (POS) taggers. We find that the HMMs estimated by EM generally assign a roughly equal number of word tokens to each hidden state, while the empirical distribution of tokens to POS tags is highly skewed. This motivates a Bayesian approach using a sparse prior to bias the estimator toward such a skewed distribution. We investigate Gibbs Sampling (GS) and Variational Bayes (VB) estimators and show that VB converges faster than GS for this task and that VB significantly improves 1-to-1 tagging accuracy over EM. We also show that EM does nearly as well as VB when the number of hidden HMM states is dramatically reduced. We also point out the high variance in all of these estimators, and that they require many more iterations to approach convergence than usually thought.

[1]  Glenn Carroll,et al.  Two Experiments on Learning Probabilistic Dependency Grammars from Corpora , 1992 .

[2]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[4]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[5]  Matthew Brand,et al.  An Entropic Estimator for Structure Discovery , 1998, NIPS.

[6]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[7]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[8]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[9]  Bruce Hayes,et al.  Linguistics: An Introduction to Linguistic Theory , 2001 .

[10]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[11]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[12]  Alexander Clark,et al.  Combining Distributional and Morphological Information for Part of Speech Induction , 2003, EACL.

[13]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[14]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[15]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[16]  Michele Banko,et al.  Part-of-Speech Tagging in Context , 2004, COLING.

[17]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[18]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[19]  Julian Besag,et al.  An Introduction to Markov Chain Monte Carlo Methods , 2004 .

[20]  Christopher D. Manning,et al.  The unsupervised learning of natural language structure , 2005 .

[21]  Q.I. Wang,et al.  Improved estimation for unsupervised part-of-speech tagging , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[22]  Noah A. Smith,et al.  Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.

[23]  Dan Klein,et al.  Prototype-Driven Learning for Sequence Models , 2006, NAACL.

[24]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[25]  Kenichi Kurihara,et al.  Variational Bayesian Grammar Induction for Natural Language , 2006, ICGI.

[26]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[27]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[28]  Noah A. Smith,et al.  Novel estimation methods for unsupervised discovery of latent structure in natural language text , 2007 .

[29]  Thomas L. Griffiths,et al.  Bayesian Inference for PCFGs via Markov Chain Monte Carlo , 2007, NAACL.