论文信息 - Modelling the Lexicon in Unsupervised Part of Speech Induction

Modelling the Lexicon in Unsupervised Part of Speech Induction

Automatically inducing the syntactic part-of-speech categories for words in text is a fundamental task in Computational Linguistics. While the performance of unsupervised tagging models has been slowly improving, current state-of-the-art systems make the obviously incorrect assumption that all tokens of a given word type must share a single part-of-speech tag. This one-tag-per-type heuristic counters the tendency of Hidden Markov Model based taggers to over generate tags for a given word type. However, it is clearly incompatible with basic syntactic theory. In this paper we extend a state-of-the-art Pitman-Yor Hidden Markov Model tagger with an explicit model of the lexicon. In doing so we are able to incorporate a soft bias towards inducing few tags per type. We develop a particle filter for drawing samples from the posterior of our model and present empirical results that show that our model is competitive with and faster than the state-of-the-art without making any unrealistic restrictions.

Phil Blunsom | Gregory Dubbin

[1] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[2] Jianfeng Gao,et al. A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers , 2008, EMNLP.

[3] Bernard Mérialdo,et al. Tagging English Text with a Probabilistic Model , 1994, CL.

[4] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5] Regina Barzilay,et al. Simple Type-Level Unsupervised POS Tagging , 2010, EMNLP.

[6] Phil Blunsom,et al. Unsupervised Bayesian Part of Speech Inference with Particle Gibbs , 2012, ECML/PKDD.

[7] Yee Whye Teh,et al. A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[8] Mark Steedman,et al. Two Decades of Unsupervised POS Induction: How Far Have We Come? , 2010, EMNLP.

[9] Slav Petrov,et al. Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[10] Julian M. Kupiec,et al. Robust part-of-speech tagging using a hidden Markov model , 1992 .

[11] A. Doucet,et al. Particle Markov chain Monte Carlo methods , 2010 .