Empirical Risk Minimization for Probabilistic Grammars: Sample Complexity and Hardness of Learning

Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. They are used ubiquitously in computational linguistics. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of probabilistic grammars using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. By making assumptions about the underlying distribution that are appropriate for natural language scenarios, we are able to derive distribution-dependent sample complexity bounds for probabilistic grammars. We also give simple algorithms for carrying out empirical risk minimization using this framework in both the supervised and unsupervised settings. In the unsupervised case, we show that the problem of minimizing empirical risk is NP-hard. We therefore suggest an approximate algorithm, similar to expectation-maximization, to minimize the empirical risk.

[1]  Dan Klein,et al.  Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency , 2004, ACL.

[2]  Michael Collins,et al.  Parameter Estimation for Statistical Parsing Models: Theory and Practice of , 2001, IWPT.

[3]  Shalom Lappin,et al.  Unsupervised Learning and Grammar Induction , 2010 .

[4]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[5]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[6]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[7]  Chris Fox,et al.  The Handbook of Computational Linguistics and Natural Language Processing , 2010 .

[8]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[9]  Noah A. Smith,et al.  Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization , 2010, ACL.

[10]  Noah A. Smith,et al.  Covariance in Unsupervised Learning of Probabilistic Grammars , 2010, J. Mach. Learn. Res..

[11]  Leonard Pitt,et al.  Inductive Inference, DFAs, and Computational Complexity , 1989, AII.

[12]  Noah A. Smith,et al.  Empirical Risk Minimization with Approximations of Probabilistic Grammars , 2010, NIPS.

[13]  Ohad Shamir,et al.  Learnability and Stability in the General Learning Setting , 2009, COLT.

[14]  Glenn Carroll,et al.  Two Experiments on Learning Probabilistic Dependency Grammars from Corpora , 1992 .

[15]  Leslie G. Valiant,et al.  Cryptographic Limitations on Learning Boolean Formulae and Finite Automata , 1993, Machine Learning: From Theory to Applications.

[16]  Sham M. Kakade,et al.  A spectral algorithm for learning Hidden Markov Models , 2008, J. Comput. Syst. Sci..

[17]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[18]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[19]  Dana Ron,et al.  Automata learning and its applications , 1995, Technical report.

[20]  Colin de la Higuera,et al.  A bibliographical study of grammatical inference , 2005, Pattern Recognit..

[21]  Yoshiyasu Ishigami,et al.  VC-dimensions of Finite Automata and Commutative Finite Automata with k Letters and n States , 1997, Discret. Appl. Math..

[22]  Morgan Sonderegger,et al.  The VC dimension of constraint-based grammars , 2010 .

[23]  Naoki Abe,et al.  Polynomial learnability of probabilistic concepts with respect to the Kullback-Leibler divergence , 1991, COLT '91.

[24]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[25]  Herbert Jaeger,et al.  Observable Operator Models for Discrete Stochastic Time Series , 2000, Neural Computation.

[26]  Sanjoy Dasgupta,et al.  The Sample Complexity of Learning Fixed-Structure Bayesian Networks , 1997, Machine Learning.

[27]  Fernando Pereira,et al.  Inside-Outside Reestimation From Partially Bracketed Corpora , 1992, HLT.

[28]  Dana Ron,et al.  On the learnability and usage of acyclic probabilistic finite automata , 1995, COLT '95.

[29]  Anton Nijholt,et al.  Context-free grammars: Covers, normal forms, and parsing , 1980, Lecture Notes in Computer Science.

[30]  Ariadna Quattoni,et al.  A Spectral Learning Algorithm for Finite State Transducers , 2011, ECML/PKDD.

[31]  Daniel Gildea,et al.  Optimal Parsing Strategies for Linear Context-Free Rewriting Systems , 2010, NAACL.

[32]  J. Lamperti ON CONVERGENCE OF STOCHASTIC PROCESSES , 1962 .

[33]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[34]  Alexander Clark,et al.  PAC-learnability of Probabilistic Deterministic Finite State Automata , 2004, J. Mach. Learn. Res..

[35]  Giorgio Satta,et al.  Cross-Entropy and Estimation of Probabilistic Context-Free Grammars , 2006, NAACL.

[36]  Rafael C. Carrasco Accurate Computation of the Relative Entropy Between Stochastic Regular Grammars , 1997, RAIRO Theor. Informatics Appl..

[37]  Maria-Florina Balcan,et al.  A discriminative model for semi-supervised learning , 2010, J. ACM.

[38]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[39]  Manfred K. Warmuth,et al.  On the Computational Complexity of Approximating Distributions by Probabilistic Automata , 1990, COLT '90.

[40]  Sebastiaan A. Terwijn,et al.  On the Learnability of Hidden Markov Models , 2002, ICGI.

[41]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[42]  Yoshiyasu Ishigami,et al.  The VC-Dimensions of Finite Automata with n States , 1993, ALT.

[43]  Amaury Habrard,et al.  A Polynomial Algorithm for the Inference of Context Free Languages , 2008, ICGI.

[44]  Dana Angluin,et al.  Learning Regular Sets from Queries and Counterexamples , 1987, Inf. Comput..

[45]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[46]  René Leermakers,et al.  How to Cover a Grammar , 1989, ACL.

[47]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[48]  Giorgio Satta,et al.  An Optimal-Time Binarization Algorithm for Linear Context-Free Rewriting Systems with Fan-Out Two , 2009, ACL/IJCNLP.

[49]  Alexander Clark,et al.  Complexity in Language Acquisition , 2013, Top. Cogn. Sci..

[50]  Zhiyi Chi,et al.  Statistical Properties of Probabilistic Context-Free Grammars , 1999, CL.

[51]  J. C. Palma Filho Políticas de Formação Professores para a Educação Básica: A Proposta Federal e a do Estado de São Paulo , 2009 .

[52]  Naoki Abe,et al.  On the computational complexity of approximating distributions by probabilistic automata , 1990, Machine Learning.

[53]  Michael Sipser,et al.  Introduction to the Theory of Computation , 1996, SIGA.

[54]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[55]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[56]  R. Repetto,et al.  The forest for the trees? : government policies and the misuse of forest resources , 1988 .

[57]  M. Talagrand Sharper Bounds for Gaussian and Empirical Processes , 1994 .

[58]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[59]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[60]  Paul W. Goldberg,et al.  PAC-learnability of probabilistic deterministic finite state automata in terms of variation distance , 2007, Theor. Comput. Sci..