Proceedings of the Workshop on Prior Knowledge for Text and Language Processing

Making complex decisions in real world problems often involves assigning values to sets of interdependent variables where an expressive dependency structure among these can influence, or even dictate, what assignments are possible. Commonly used models typically ignore expressive dependencies since the traditional way of incorporating non-local dependencies is inefficient and hence lead to expensive training and inference. This paper presents Constrained Conditional Models (CCMs), a framework that augments probabilistic models with declarative constraints as a way to support decisions in an expressive output space while maintaining modularity and tractability of training. We develop, analyze and compare novel algorithms for training and inference with CCMs. Our main experimental study exhibits the advantage our framework provides when declarative constraints are used in the context of supervised and semi-supervised training of a probabilistic model.

[1]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[2]  P. Green Iteratively reweighted least squares for maximum likelihood estimation , 1984 .

[3]  W. A. Woods,et al.  Language processing for speech understanding , 1986 .

[4]  P. J. Price,et al.  Evaluation of Spoken Language Systems: the ATIS Domain , 1990, HLT.

[5]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[6]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Stephanie Seneff,et al.  TINA: A Natural Language System for Spoken Language Applications , 1992, Comput. Linguistics.

[9]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[10]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[11]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[12]  David Elworthy,et al.  Does Baum-Welch Re-estimation Help Taggers? , 1994, ANLP.

[13]  Roberto Pieraccini,et al.  A Learning Approach to Natural Language Understanding , 1994, ArXiv.

[14]  Richard M. Schwartz,et al.  Hidden Understanding Models of Natural Language , 1994, ACL.

[15]  François Andry,et al.  Interleaving Syntax and Semantics in an Effecient Bottom-Up Parser , 1994, ACL.

[16]  Wayne H. Ward,et al.  Recent Improvements in the CMU Spoken Language Understanding System , 1994, HLT.

[17]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[18]  Joachim M. Buhmann,et al.  A theory of proximity based clustering: structure detection by optimization , 2000, Pattern Recognit..

[19]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[20]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.

[21]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[22]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[23]  Ted Briscoe,et al.  Robust Accurate Statistical Annotation of General Text , 2002, LREC.

[24]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  James Theiler,et al.  Online Feature Selection using Grafting , 2003, ICML.

[26]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[27]  Michele Banko,et al.  Part-of-Speech Tagging in Context , 2004, COLING.

[28]  Joshua Goodman,et al.  Exponential Priors for Maximum Entropy Models , 2004, NAACL.

[29]  Volker Roth,et al.  The generalized LASSO , 2004, IEEE Transactions on Neural Networks.

[30]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[31]  Noah A. Smith,et al.  Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.

[32]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[33]  Steve J. Young,et al.  Semantic processing using the Hidden Vector State model , 2005, Comput. Speech Lang..

[34]  Alex Acero,et al.  Spoken Language Understanding "” An Introduction to the Statistical Framework , 2005 .

[35]  Nigel Collier,et al.  Automatic Classification of Verbs in Biomedical Texts , 2006, ACL.

[36]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[37]  Jason Weston,et al.  Inference with the Universum , 2006, ICML.

[38]  Dan Klein,et al.  Prototype-Driven Learning for Sequence Models , 2006, NAACL.

[39]  Sabine Schulte im Walde Experiments on the Automatic Induction of German Semantic Verb Classes , 2006, CL.

[40]  Ted Briscoe,et al.  A Large Subcategorization Lexicon for Natural Language Processing Applications , 2006, LREC.

[41]  Honglak Lee,et al.  Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[42]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[43]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[44]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[45]  Yee Whye Teh,et al.  Collapsed Variational Dirichlet Process Mixture Models , 2007, IJCAI.

[46]  Mark Johnson,et al.  A Bayesian LDA-based model for semi-supervised part-of-speech tagging , 2007, NIPS.

[47]  Mark Johnson,et al.  Why Doesn’t EM Find Good HMM POS-Taggers? , 2007, EMNLP.

[48]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[49]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[50]  Ming-Wei Chang,et al.  Guiding Semi-Supervision with Constraint-Driven Learning , 2007, ACL.

[51]  Chih-Jen Lin,et al.  Projected Gradient Methods for Nonnegative Matrix Factorization , 2007, Neural Computation.

[52]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[53]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[54]  N. Schraudolph,et al.  A quasi-Newton approach to non-smooth convex optimization , 2008, ICML '08.

[55]  Yuval Krymolowski,et al.  Verb Class Discovery from Rich Syntactic Data , 2008, CICLing.