Guiding Unsupervised Grammar Induction Using Contrastive Estimation

We describe a novel training criterion for probabilistic grammar induction models, contrastive estimation [Smith and Eisner, 2005], which can be interpreted as exploiting implicit negative evidence and includes a wide class of likelihood-based objective functions. This criterion is a generalization of the function maximized by the ExpectationMaximization algorithm [Dempster et al., 1977]. CE is a natural fit for log-linear models, which can include arbitrary features but for which EM is computationally difficult. We show that, using the same features, log-linear dependency grammar models trained using CE can drastically outperform EMtrained generative models on the task of matching human linguistic annotations (the MATCHLINGUIST task). The selection of an implicit negative evidence class—a “neighborhood”—appropriate to a given task has strong implications, but a good neighborhood one can target the objective of grammar induction to a specific application.

[1]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  J. Baker Trainable grammars for speech recognition , 1979 .

[4]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[5]  Kenneth Ward Church,et al.  A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[6]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[7]  Fernando Pereira,et al.  Inside-Outside Reestimation From Partially Bracketed Corpora , 1992, HLT.

[8]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[9]  Glenn Carroll,et al.  Two Experiments on Learning Probabilistic Dependency Grammars from Corpora , 1992 .

[10]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[11]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[12]  David Yarowsky,et al.  DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION: Application to Accent Restoration in Spanish and French , 1994, ACL.

[13]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[14]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[15]  Beryl Hoffman,et al.  Computational analysis of the syntax and interpretation of "free" word order in Turkish , 1995 .

[16]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[17]  Steven P. Abney Stochastic Attribute-Value Grammars , 1996, CL.

[18]  Stefan Riezler,et al.  Probabilistic Constraint Logic Programming , 1997, ArXiv.

[19]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[20]  Ronald Rosenfeld,et al.  A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[21]  Mark Johnson,et al.  Lexicalized Stochastic Modeling of Constraint-Based Grammars using Log-Linear Measures and EM Training , 2000, ACL.

[22]  Menno van Zaanen,et al.  Bootstrapping structure into language : alignment-based learning , 2001, ArXiv.

[23]  Mark Johnson,et al.  Joint and Conditional Estimation of Tagging and Parsing Models , 2001, ACL.

[24]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[25]  Dimitra Vergyri,et al.  Integration of multiple knowledge sources in speech recognition using minimum error training , 2001 .

[26]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[27]  Dan Klein,et al.  Conditional Structure versus Conditional Estimation in NLP Models , 2002, EMNLP.

[28]  Tsujii Jun'ichi,et al.  Maximum entropy estimation for feature forests , 2002 .

[29]  Dan Klein,et al.  A Generative Constituent-Context Model for Improved Grammar Induction , 2002, ACL.

[30]  Alexander Clark,et al.  Unsupervised Language Acquisition: Theory and Practice , 2002, ArXiv.

[31]  Thomas Hofmann,et al.  Investigating Loss Functions and Optimization Methods for Discriminative Learning of Label Sequences , 2003, EMNLP.

[32]  Vladimir Solmon,et al.  The estimation of stochastic context-free grammars using the Inside-Outside algorithm , 2003 .

[33]  Noah A. Smith,et al.  Bilingual Parsing with Factored Estimation: Using English to Parse Korean , 2004, EMNLP.

[34]  Brian Roark,et al.  Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm , 2004, ACL.

[35]  Dan Klein,et al.  Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency , 2004, ACL.

[36]  Noah A. Smith,et al.  Annealing Techniques For Unsupervised Statistical Language Learning , 2004, ACL.

[37]  Noah A. Smith,et al.  Dyna: a declarative language for implementing dynamic programs , 2004, ACL 2004.

[38]  Jonas Kuhn Experiments in parallel-text based grammar induction , 2004, ACL.

[39]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[40]  Noah A. Smith,et al.  Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.