Bayesian learning of probabilistic language models

The general topic of this thesis is the probabilistic modeling of language, in particular natural language. In probabilistic language modeling, one characterizes the strings of phonemes, words, etc. of a certain domain in terms of a probability distribution over all possible strings within the domain. Probabilistic language modeling has been applied to a wide range of problems in recent years, from the traditional uses in speech recognition to more recent applications in biological sequence modeling. The main contribution of this thesis is a particular approach to the learning problem for probabilistic language models, known as Bayesian model merging. This approach can be characterized as follows. (1) Models are built either in batch mode or incrementally from samples, by incorporating individual samples into a working model. (2) A uniform, small number of simple operators works to gradually transform an instance-based model to a generalized model that abstracts from the data. (3) Instance-based parts of a model can coexist with generalized ones, depending on the degree of similarity among the observed samples, allowing the model to adapt to non-uniform coverage of the sample space. (4) The generalization process is driven and controlled by a uniform, probabilistic metric: the Bayesian posterior probability of a model, integrating both criteria of goodness-of-fit with respect to the data and a notion of model simplicity ('Occam's Razor'). The Bayesian model merging framework is instantiated for three different classes of probabilistic models: Hidden Markov Models (HMMs), stochastic context-free grammars (SCFGs), and simple probabilistic attribute grammars (PAGs). Along with the theoretical background, various applications and case studies are presented, including the induction of multiple-pronunciation word models for speech recognition (with HMMs), data-driven learning of syntactic structures (with SCFGs), and the learning of simple sentence-meaning associations from examples (with PAGs). Apart from language learning issues, a number of related computational problems involving probabilistic context-free grammars are discussed. A version of Earley's parser is presented that solves the standard problems associated with SCFGs efficiently, including the computation of sentence probabilities and sentence prefix probabilities, finding most likely parses, and the estimation of grammar parameters. Finally, we describe an algorithm that computes n-gram statistics from a given SCFG, based on solving linear systems derived from the grammar. This method can be an effective tool to transform part of the probabilistic knowledge from a structured language model into an unstructured low-level form for use in applications such as speech decoding. We show how this problem is just an instance of a larger class of related ones (such as average sentence length or derivation entropy) that are all solvable with the same computational technique. An introductory chapter tries to present a unified view of the various model types and algorithms found in the literature, as well as issues of model learning and estimation.

[1]  F. D. Saussure Cours de linguistique générale , 1924 .

[2]  P. Garvin,et al.  Prolegomena to a Theory of Language , 1953 .

[3]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[4]  A. Reber Implicit learning of artificial grammars , 1967 .

[5]  James Jay Horning,et al.  A study of grammatical inference , 1969 .

[6]  Thomas G. Evans,et al.  Grammatical Inference Techniques in Pattern Analysis , 1971 .

[7]  Alfred V. Aho,et al.  The Theory of Parsing, Translation, and Compiling , 1972 .

[8]  Harry Charles Lee,et al.  Stochastic linguistics for picture recognition , 1972 .

[9]  Taylor L. Booth,et al.  Applying Probability Measures to Abstract Languages , 1973, IEEE Transactions on Computers.

[10]  C. M. Cook,et al.  Grammatical inference by hill climbing , 1976, Inf. Sci..

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  J. Gerard Wolff,et al.  Grammar Discovery as Data Compression , 1978, AISB/GI.

[13]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[14]  J. Baker Trainable grammars for speech recognition , 1979 .

[15]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[16]  Walter L. Ruzzo,et al.  An Improved Context-Free Recognizer , 1980, ACM Trans. Program. Lang. Syst..

[17]  Leona F. Fass,et al.  Learning context-free languages from their structured sentences , 1983, SIGA.

[18]  Jay Earley,et al.  An efficient context-free parsing algorithm , 1970, Commun. ACM.

[19]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[20]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Carl H. Smith,et al.  Inductive Inference: Theory and Methods , 1983, CSUR.

[22]  Hermann Ney,et al.  The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[23]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[24]  Geoffrey K. Pullum,et al.  Generalized Phrase Structure Grammar , 1985 .

[25]  Frederick Jelinek,et al.  Markov Source Modeling of Text Generation , 1985 .

[26]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[27]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[28]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[29]  Stuart M. Shieber,et al.  An Introduction to Unification-Based Approaches to Grammar , 1986, CSLI Lecture Notes.

[30]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[31]  Elissa L Newport,et al.  Structural packaging in the input to language learning: Contributions of prosodic and morphological marking of phrases to the acquisition of language , 1987, Cognitive Psychology.

[32]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[33]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[34]  Charles J. Fillmore,et al.  The Mechanisms of “Construction Grammar” , 1988 .

[35]  Annedore Paeseler Modification of Earley's algorithm for speech recognition , 1988 .

[36]  S. Gull Bayesian Inductive Inference and Maximum Entropy , 1988 .

[37]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[38]  Yasubumi Sakakibara,et al.  Learning context-free grammars from structural data in polynomial time , 1988, COLT '88.

[39]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[40]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[41]  John Cocke,et al.  Probabilistic Parsing Method for Sentence Disambiguation , 1989, IWPT.

[42]  Francine R. Chen Identification of contextual factors for pronunciation networks , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[43]  J. H. Wright,et al.  LR parsing of probabilistic grammars with input uncertainty for speech recognition , 1990 .

[44]  R. T. Cox Probability, frequency and reasonable expectation , 1990 .

[45]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[46]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[47]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[48]  Wray L. Buntine Theory Refinement on Bayesian Networks , 1991, UAI.

[49]  Stephen M. Omohundro,et al.  Best-First Model Merging for Dynamic Learning and Recognition , 1991, NIPS.

[50]  Jerome A. Feldman,et al.  Learning Automata from Ordered Examples , 1991, Mach. Learn..

[51]  James Glass,et al.  Integration of speech recognition and natural language processing in the MIT VOYAGER system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[52]  Michael Riley,et al.  A statistical model for generating pronunciation networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[53]  Giorgio Satta,et al.  Computation of Probabilities for an Island-Driven Parser , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[54]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[55]  Axel Cleeremans Mechanisms of implicit learning: a parallel distributed processing model of sequence acquisition , 1991 .

[56]  John D. Lafferty,et al.  Computation of the Probability of Initial Substring Generation by Stochastic Context-Free Grammars , 1991, Comput. Linguistics.

[57]  Mitchell P. Marcus,et al.  Pearl: A Probabilistic Chart Parser , 1991, EACL.

[58]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[59]  Chin-Hui Lee,et al.  Bayesian Learning of Gaussian Mixture Densities for Hidden Markov Models , 1991, HLT.

[60]  Philip Resnik,et al.  Probabilistic Tree-Adjoining Grammar as a Framework for Statistical Natural Language Processing , 1992, COLING.

[61]  Jason Eisner,et al.  A Probabilistic Parser Applied to Software Testing Documents , 1992, AAAI.

[62]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[63]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[64]  Mark Alan Jones,et al.  A Probabilistic Parser and Its Application , 1992 .

[65]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[66]  Frederick Jelinek,et al.  Basic Methods of Probabilistic Context Free Grammars , 1992 .

[67]  F. Pereira,et al.  Inside-Outside Reestimation From Partially Bracketed Corpora , 1992, ACL.

[68]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[69]  J. Kupiec Hidden Markov estimation for unrestricted stochastic context-free grammars , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[70]  Terrance Philip Regier,et al.  The acquisition of lexical semantics for spatial terms: a connectionist model of perceptual categorization , 1992 .

[71]  Pierre Baldi,et al.  Hidden Markov Models in Molecular Biology: New Algorithms and Applications , 1992, NIPS.

[72]  David M. Magerman,et al.  Efficiency, Robustness and Accuracy in Picky Chart Parsing , 1992, ACL.

[73]  Hervé Bourlard,et al.  Connectionist speech recognition , 1993 .

[74]  Dana Ron,et al.  The Power of Amnesia , 1993, NIPS.

[75]  E. Brill,et al.  Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach , 1993, HLT.

[76]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[77]  D. Haussler,et al.  Protein modeling using hidden Markov models: analysis of globins , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[78]  Ted Briscoe,et al.  Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars , 1993, CL.

[79]  R. C. Underwood,et al.  THE APPLICATION OF STOCHASTIC CONTEXT-FREE GRAMMARS TO FOLDING, ALIGNING AND MODELING HOMOLOGOUS RNA SEQUENCES , 1993 .

[80]  Andreas Stolcke,et al.  Inducing Probabilistic Grammars by Bayesian Model Merging , 1994, ICGI.

[81]  Andreas Stolcke,et al.  Best-first Model Merging for Hidden Markov Model Induction , 1994, ArXiv.

[82]  Ido Dagan,et al.  Similarity-Based Estimation of Word Cooccurrence Probabilities , 1994, ACL.

[83]  Andreas Stolcke,et al.  The berkeley restaurant project , 1994, ICSLP.

[84]  Andreas Stolcke,et al.  Multiple-pronunciation lexical modeling in a speaker independent speech understanding system , 1994, ICSLP.

[85]  Andreas Stolcke,et al.  An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities , 1994, CL.