A Bayesian Mixture Model for PoS Induction Using Multiple Features

In this paper we present a fully unsupervised syntactic class induction system formulated as a Bayesian multinomial mixture model, where each word type is constrained to belong to a single class. By using a mixture model rather than a sequence model (e.g., HMM), we are able to easily add multiple kinds of features, including those at both the type level (morphology features) and token level (context and alignment features, the latter from parallel corpora). Using only context features, our system yields results comparable to state-of-the art, far better than a similar model without the one-class-per-type constraint. Using the additional features provides added benefit, and our final system outperforms the best published results on most of the 25 corpora tested.

[1]  Noah A. Smith,et al.  Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.

[2]  Chu-Ren Huang,et al.  Sinica Treebank: Design Criteria, Representational Issues and Implementation , 2004 .

[3]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[4]  Regina Barzilay,et al.  Simple Type-Level Unsupervised POS Tagging , 2010, EMNLP.

[5]  Alexander Clark,et al.  Combining Distributional and Morphological Information for Part of Speech Induction , 2003, EACL.

[6]  Alexandra Kinyon,et al.  Building a Treebank for French , 2000, LREC.

[7]  Petr Pajas,et al.  MorphoTrees of Arabic and Their Annotation in the TrEd Environment , .

[8]  Nick Chater,et al.  Distributional Information: A Powerful Cue for Acquiring Syntactic Categories , 1998, Cogn. Sci..

[9]  Petr Pajas,et al.  PDT-VALLEX : Creating a Large-coverage Valency Lexicon for Treebank Annotation , 2003 .

[10]  Mark Steedman,et al.  Two Decades of Unsupervised POS Induction: How Far Have We Come? , 2010, EMNLP.

[11]  Kevin Knight,et al.  Minimized Models for Unsupervised Part-of-Speech Tagging , 2009, ACL.

[12]  Dan Klein,et al.  A Generative Constituent-Context Model for Improved Grammar Induction , 2002, ACL.

[13]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[14]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[15]  Mark Johnson,et al.  SVD and Clustering for Unsupervised POS Tagging , 2010, ACL.

[16]  Maria Antònia Martí,et al.  Cat3LB and Cast3LB: From Constituents to Dependencies , 2006, FinTAL.

[17]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[18]  Saso Dzeroski,et al.  Towards a Slovene Dependency Treebank , 2006, LREC.

[19]  Mark Johnson,et al.  A Bayesian LDA-based model for semi-supervised part-of-speech tagging , 2007, NIPS.

[20]  Gregory Crane,et al.  An Ownership Model of Annotation: The Ancient Greek Dependency Treebank , 2009 .

[21]  Mark Johnson,et al.  Why Doesn’t EM Find Good HMM POS-Taggers? , 2007, EMNLP.

[22]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[23]  Christian Biemann,et al.  Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering , 2006, ACL.

[24]  Dan Klein,et al.  Prototype-Driven Learning for Sequence Models , 2006, NAACL.

[25]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[26]  Regina Barzilay,et al.  Unsupervised Multilingual Learning for Morphological Segmentation , 2008, ACL.

[27]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[28]  Petya Osenova,et al.  Design and Implementation of the Bulgarian HPSG-based Treebank , 2004 .

[29]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[30]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[31]  Tomaz Erjavec,et al.  MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora , 2004, LREC.

[32]  John DeNero,et al.  Painless Unsupervised Learning with Features , 2010, NAACL.

[33]  Regina Barzilay,et al.  Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches , 2009, J. Artif. Intell. Res..

[34]  Mathias Creutz,et al.  INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT , 2005 .