Hierarchical clustering of word class distributions

We propose an unsupervised approach to POS tagging where first we associate each word type with a probability distribution over word classes using Latent Dirichlet Allocation. Then we create a hierarchical clustering of the word types: we use an agglomerative clustering algorithm where the distance between clusters is defined as the Jensen-Shannon divergence between the probability distributions over classes associated with each word-type. When assigning POS tags, we find the tree leaf most similar to the current word and use the prefix of the path leading to this leaf as the tag. This simple labeler outperforms a baseline based on Brown clusters on 9 out of 10 datasets.

[1]  Regina Barzilay,et al.  Simple Type-Level Unsupervised POS Tagging , 2010, EMNLP.

[2]  Nick Chater,et al.  Distributional Information: A Powerful Cue for Acquiring Syntactic Categories , 1998, Cogn. Sci..

[3]  Mark Johnson,et al.  SVD and Clustering for Unsupervised POS Tagging , 2010, ACL.

[4]  Jakob Uszkoreit,et al.  Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure , 2012, NAACL.

[5]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[6]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[7]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[8]  Grzegorz Chrupala,et al.  Online Entropy-Based Model of Lexical Category Acquisition , 2010, CoNLL.

[9]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[10]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[11]  Dekang Lin,et al.  Phrase Clustering for Discriminative Learning , 2009, ACL.

[12]  Grzegorz Chrupala,et al.  Efficient induction of probabilistic word classes with LDA , 2011, IJCNLP.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Mark Steedman,et al.  A Bayesian Mixture Model for PoS Induction Using Multiple Features , 2011, EMNLP.

[15]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[16]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[17]  Kevin Knight,et al.  Minimized Models for Unsupervised Part-of-Speech Tagging , 2009, ACL.

[18]  Toben H. Mintz Frequent frames as a cue for grammatical categories in child directed speech , 2003, Cognition.

[19]  Afsaneh Fazly,et al.  An Incremental Bayesian Model for Learning Syntactic Categories , 2008, CoNLL.