A Simple Unsupervised Learner for POS Disambiguation Rules Given Only a Minimal Lexicon

We propose a new model for unsupervised POS tagging based on linguistic distinctions between open and closed-class items. Exploiting notions from current linguistic theory, the system uses far less information than previous systems, far simpler computational methods, and far sparser descriptions in learning contexts. By applying simple language acquisition techniques based on counting, the system is given the closed-class lexicon, acquires a large open-class lexicon and then acquires disambiguation rules for both. This system achieves a 20% error reduction for POS tagging over state-of-the-art unsupervised systems tested under the same conditions, and achieves comparable accuracy when trained with much less prior information.

[1]  Erwin Chan,et al.  Structures and distributions in morphology learning , 2008 .

[2]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[3]  Anthony S. Kroch,et al.  The Linguistic Relevance of Tree Adjoining Grammar , 1985 .

[4]  Dan Klein,et al.  Prototype-Driven Learning for Sequence Models , 2006, NAACL.

[5]  Mark Johnson,et al.  A Bayesian LDA-based model for semi-supervised part-of-speech tagging , 2007, NIPS.

[6]  Noah A. Smith,et al.  Novel estimation methods for unsupervised discovery of latent structure in natural language text , 2007 .

[7]  Noam Chomsky Approaching UG from Below , 2006 .

[8]  Charles N. Li,et al.  Mandarin Chinese: A Functional Reference Grammar , 1989 .

[9]  Charles D. Yang,et al.  Knowledge and learning in natural language , 2000 .

[10]  Michele Banko,et al.  Part-of-Speech Tagging in Context , 2004, COLING.

[11]  Alexander Clark,et al.  Combining Distributional and Morphological Information for Part of Speech Induction , 2003, EACL.

[12]  Giorgio Satta,et al.  Guided Learning for Bidirectional Sequence Classification , 2007, ACL.

[13]  Elissa L. Newport,et al.  The distributional structure of grammatical categories in speech to young children , 2002, Cogn. Sci..

[14]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[15]  Hinrich Schütze,et al.  Part-of-Speech Induction From Scratch , 1993, ACL.

[16]  Hrafn Loftsson,et al.  Tagging Icelandic text: A linguistic rule-based approach , 2008, Nordic Journal of Linguistics.

[17]  Robert Frank,et al.  Phase theory and Tree Adjoining Grammar , 2006 .

[18]  Noah A. Smith,et al.  Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.