Modelling the Lexicon in Unsupervised Part of Speech Induction

Automatically inducing the syntactic part-of-speech categories for words in text is a fundamental task in Computational Linguistics. While the performance of unsupervised tagging models has been slowly improving, current state-of-the-art systems make the obviously incorrect assumption that all tokens of a given word type must share a single part-of-speech tag. This one-tag-per-type heuristic counters the tendency of Hidden Markov Model based taggers to over generate tags for a given word type. However, it is clearly incompatible with basic syntactic theory. In this paper we extend a state-of-the-art Pitman-Yor Hidden Markov Model tagger with an explicit model of the lexicon. In doing so we are able to incorporate a soft bias towards inducing few tags per type. We develop a particle filter for drawing samples from the posterior of our model and present empirical results that show that our model is competitive with and faster than the state-of-the-art without making any unrealistic restrictions.

[1]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[2]  Jianfeng Gao,et al.  A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers , 2008, EMNLP.

[3]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[4]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5]  Regina Barzilay,et al.  Simple Type-Level Unsupervised POS Tagging , 2010, EMNLP.

[6]  Phil Blunsom,et al.  Unsupervised Bayesian Part of Speech Inference with Particle Gibbs , 2012, ECML/PKDD.

[7]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[8]  Mark Steedman,et al.  Two Decades of Unsupervised POS Induction: How Far Have We Come? , 2010, EMNLP.

[9]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[10]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[11]  A. Doucet,et al.  Particle Markov chain Monte Carlo methods , 2010 .

[12]  Phil Blunsom,et al.  A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction , 2011, ACL.

[13]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[14]  Mark Johnson,et al.  A Bayesian LDA-based model for semi-supervised part-of-speech tagging , 2007, NIPS.

[15]  Mark Johnson,et al.  Why Doesn’t EM Find Good HMM POS-Taggers? , 2007, EMNLP.

[16]  Thomas L. Griffiths,et al.  Interpolating between types and tokens by estimating power-law generators , 2005, NIPS.

[17]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[18]  John DeNero,et al.  Painless Unsupervised Learning with Features , 2010, NAACL.

[19]  Dan Klein,et al.  Type-Based MCMC , 2010, HLT-NAACL.

[20]  Ben Taskar,et al.  Wiki-ly Supervised Part-of-Speech Tagging , 2012, EMNLP.

[21]  Alexander Clark,et al.  Combining Distributional and Morphological Information for Part of Speech Induction , 2003, EACL.

[22]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.