Bayesian Analysis in Natural Language Processing

Abstract Natural language processing (NLP) went through a profound transformation in the mid-1980s when it shifted to make heavy use of corpora and data-driven techniques to analyze language. Since then, the use of statistical techniques in NLP has evolved in several ways. One such example of evolution took place in the late 1990s or early 2000s, when full-fledged Bayesian machinery was introduced to NLP. This Bayesian approach to NLP has come to accommodate for various shortcomings in the frequentist approach and to enrich it, especially in the unsupervised setting, where statistical learning is done without target prediction examples. We cover the methods and algorithms that are needed to fluently read Bayesian learning papers in NLP and to do research in the area. These methods and algorithms are partially borrowed from both machine learning and statistics and are partially developed "in-house" in NLP. We cover inference techniques such as Markov chain Monte Carlo sampling and variational inference, Ba...

[1]  Gholamreza Haffari,et al.  Structured Prediction of Sequences and Trees Using Infinite Contexts , 2015, ECML/PKDD.

[2]  Regina Barzilay,et al.  Unsupervised Multilingual Learning for Morphological Segmentation , 2008, ACL.

[3]  Yee Whye Teh,et al.  A stochastic memoizer for sequence data , 2009, ICML '09.

[4]  John Darlington,et al.  A Transformation System for Developing Recursive Programs , 1977, J. ACM.

[5]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[6]  Cosma Rohilla Shalizi,et al.  Philosophy and the practice of Bayesian statistics. , 2010, The British journal of mathematical and statistical psychology.

[7]  Matt Post,et al.  Bayesian Learning of a Tree Substitution Grammar , 2009, ACL.

[8]  S. Fienberg Bayesian Models and Methods in Public Policy and Government Settings , 2011, 1108.2177.

[9]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[10]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[11]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[12]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[13]  Thomas L. Griffiths,et al.  Infinite latent feature models and the Indian buffet process , 2005, NIPS.

[14]  David R. Karger,et al.  Content Modeling Using Latent Permutations , 2009, J. Artif. Intell. Res..

[15]  Yonatan Bisk,et al.  An HDP Model for Inducing Combinatory Categorial Grammars , 2013, TACL.

[16]  Michael I. Jordan,et al.  Variational methods for the Dirichlet process , 2004, ICML.

[17]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[18]  Andreas Stolcke,et al.  Inducing Probabilistic Grammars by Bayesian Model Merging , 1994, ICGI.

[19]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[20]  Mikio Yamamoto,et al.  Dirichlet mixtures in text modeling , 2005 .

[21]  R. T. Cox Probability, frequency and reasonable expectation , 1990 .

[22]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[23]  Fernando Pereira,et al.  Relating Probabilistic Grammars and Automata , 1999, ACL.

[24]  Dan Klein,et al.  Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency , 2004, ACL.

[25]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[26]  Laura Kallmeyer,et al.  Data-Driven Parsing with Probabilistic Linear Context-Free Rewriting Systems , 2010, COLING.

[27]  Jianfeng Gao,et al.  A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers , 2008, EMNLP.

[28]  Aravind K. Joshi,et al.  Tree-Adjoining Grammars , 1997, Handbook of Formal Languages.

[29]  Jun'ichi Tsujii,et al.  Probabilistic CFG with Latent Annotations , 2005, ACL.

[30]  Detlef Prescher,et al.  Head-Driven PCFGs with Latent-Head Statistics , 2005, IWPT.

[31]  Noah A. Smith,et al.  Parsing with Soft and Hard Constraints on Dependency Length , 2005 .

[32]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[33]  Shay B. Cohen,et al.  Online Adaptor Grammars with Hybrid Inference , 2014, Transactions of the Association for Computational Linguistics.

[34]  David J. Weir,et al.  Characterizing Structural Descriptions Produced by Various Grammatical Formalisms , 1987, ACL.

[35]  Dan Klein,et al.  Unsupervised Coreference Resolution in a Nonparametric Bayesian Model , 2007, ACL.

[36]  Ben O'Neill,et al.  Exchangeability, Correlation, and Bayes' Effect , 2009 .

[37]  Charles Kemp,et al.  How to Grow a Mind: Statistics, Structure, and Abstraction , 2011, Science.

[38]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[39]  B. D. Finetti,et al.  Foresight: Its Logical Laws, Its Subjective Sources , 1992 .

[40]  Markus Dreyer,et al.  Better Informed Training of Latent Syntactic Features , 2006, EMNLP.

[41]  Regina Barzilay,et al.  Adding More Languages Improves Unsupervised Multilingual Part-of-Speech Tagging: a Bayesian Non-Parametric Approach , 2009, NAACL.

[42]  John DeNero,et al.  Sampling Alignment Structure under a Bayesian Translation Model , 2008, EMNLP.

[43]  Jason Eisner,et al.  Transformational Priors Over Grammars , 2002, EMNLP.

[44]  Shin Ishii,et al.  On-line EM Algorithm for the Normalized Gaussian Network , 2000, Neural Computation.

[45]  Noah A. Smith,et al.  Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction , 2009, NAACL.

[46]  Thomas L. Griffiths,et al.  Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models , 2006, NIPS.

[47]  Noah A. Smith,et al.  Compiling Comp Ling: Weighted Dynamic Programming and the Dyna Language , 2005, HLT.

[48]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[49]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[50]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Charles Kemp,et al.  Bayesian models of cognition , 2008 .

[52]  Hal Daumé,et al.  Non-Parametric Bayesian Areal Linguistics , 2009, HLT-NAACL.

[53]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[54]  Tadao Kasami,et al.  An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages , 1965 .

[55]  Yee Whye Teh,et al.  Beam sampling for the infinite hidden Markov model , 2008, ICML '08.

[56]  Jay Earley,et al.  An efficient context-free parsing algorithm , 1970, Commun. ACM.

[57]  Dan Klein,et al.  Online EM for Unsupervised Models , 2009, NAACL.

[58]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[59]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[60]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[61]  M. Steedman,et al.  Combinatory Categorial Grammar , 2011 .

[62]  Thomas Hofmann,et al.  Gaussian process classification for segmenting and annotating sequences , 2004, ICML.

[63]  Regina Barzilay,et al.  Bayesian Unsupervised Topic Segmentation , 2008, EMNLP.

[64]  Joshua Goodman,et al.  Parsing Algorithms and Metrics , 1996, ACL.

[65]  Christopher D. Manning,et al.  Hierarchical Bayesian Domain Adaptation , 2009, NAACL.

[66]  Chris Dyer,et al.  A Gibbs Sampler for Phrasal Synchronous Grammar Induction , 2009, ACL.

[67]  Regina Barzilay,et al.  Unsupervised Multilingual Grammar Induction , 2009, ACL.

[68]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[69]  Mark Johnson,et al.  Exploring the Role of Stress in Bayesian Word Segmentation using Adaptor Grammars , 2014, TACL.

[70]  Daniel H. Younger,et al.  Recognition and Parsing of Context-Free Languages in Time n^3 , 1967, Inf. Control..

[71]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[72]  Shankar Kumar,et al.  Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2008, EMNLP.

[73]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[74]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[75]  J. Tenenbaum,et al.  A tutorial introduction to Bayesian models of cognitive development , 2011, Cognition.

[76]  J. Tenenbaum,et al.  Probabilistic models of cognition: exploring representations and inductive biases , 2010, Trends in Cognitive Sciences.

[77]  Regina Barzilay,et al.  Unsupervised Multilingual Learning for POS Tagging , 2008, EMNLP.

[78]  Matt Post,et al.  Bayesian Tree Substitution Grammars as a Usage-based Approach , 2013, Language and speech.