Structured Prediction of Sequences and Trees Using Infinite Contexts

Linguistic structures exhibit a rich array of global phenomena, however commonly used Markov models are unable to adequately describe these phenomena due to their strong locality assumptions. We propose a novel hierarchical model for structured prediction over sequences and trees which exploits global context by conditioning each generation decision on an unbounded context of prior decisions. This builds on the success of Markov models but without imposing a fixed bound in order to better represent global phenomena. To facilitate learning of this large and unbounded model, we use a hierarchical Pitman-Yor process prior which provides a recursive form of smoothing. We propose prediction algorithms based on A* and Markov Chain Monte Carlo sampling. Empirical results demonstrate the potential of our model compared to baseline finite-context Markov models on three tasks: morphological parsing, syntactic parsing and part-of-speech tagging.

[1]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[2]  Mark Steedman,et al.  Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure , 2012 .

[3]  Phil Blunsom,et al.  The PASCAL Challenge on Grammar Induction , 2012, HLT-NAACL 2012.

[4]  Dan Klein,et al.  Learning and Inference for Hierarchically Split PCFGs , 2007, AAAI.

[5]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[6]  Dan Klein,et al.  The Infinite PCFG Using Hierarchical Dirichlet Processes , 2007, EMNLP.

[7]  Thomas L. Griffiths,et al.  Bayesian Inference for PCFGs via Markov Chain Monte Carlo , 2007, NAACL.

[8]  Frank D. Wood,et al.  The sequence memoizer , 2011, Commun. ACM.

[9]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[10]  Phil Blunsom,et al.  Inducing Tree-Substitution Grammars , 2010, J. Mach. Learn. Res..

[11]  Christopher D. Manning,et al.  The Infinite Tree , 2007, ACL.

[12]  Mark Johnson,et al.  Unsupervised Word Segmentation for Sesotho Using Adaptor Grammars , 2008, SIGMORPHON.

[13]  Dan Klein,et al.  Parsing and Hypergraphs , 2001, IWPT.

[14]  Yee Whye Teh,et al.  A stochastic memoizer for sequence data , 2009, ICML '09.

[15]  John Cocke,et al.  Programming languages and their compilers: Preliminary notes , 1969 .

[16]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[17]  Yee Whye Teh,et al.  Improvements to the Sequence Memoizer , 2010, NIPS.

[18]  Joshua Goodman,et al.  Parsing Algorithms and Metrics , 1996, ACL.

[19]  Mark Johnson,et al.  PCFG Models of Linguistic Tree Representations , 1998, CL.

[20]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[21]  Mary P. Harper,et al.  A Second-Order Hidden Markov Model for Part-of-Speech Tagging , 1999, ACL.

[22]  Jun'ichi Tsujii,et al.  Probabilistic CFG with Latent Annotations , 2005, ACL.

[23]  Chris Dyer,et al.  A Bayesian Model for Learning SCFGs with Discontiguous Rules , 2012, EMNLP.

[24]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[25]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[26]  Eiichiro Sumita,et al.  The Infinite Markov Model , 2007, NIPS.

[27]  Thomas L. Griffiths,et al.  Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models , 2006, NIPS.

[28]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[29]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.