Inducing Probabilistic Grammars by Bayesian Model Merging

We describe a framework for inducing probabilistic grammars from corpora of positive samples. First, samples are incorporated by adding ad-hoc rules to a working grammar; subsequently, elements of the model (such as states or nonterminals) are merged to achieve generalization and a more compact representation. The choice of what to merge and when to stop is governed by the Bayesian posterior probability of the grammar given the data, which formalizes a trade-off between a close fit to the data and a default preference for simpler models (‘Occam's Razor’). The general scheme is illustrated using three types of probabilistic grammars: Hidden Markov models, class-based n-grams, and stochastic context-free grammars.

[1]  James Jay Horning,et al.  A study of grammatical inference , 1969 .

[2]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[3]  D. Lindley,et al.  Bayes Estimates for the Linear Model , 1972 .

[4]  Taylor L. Booth,et al.  Applying Probability Measures to Abstract Languages , 1973, IEEE Transactions on Computers.

[5]  Azriel Rosenfeld,et al.  Grammatical inference by hill climbing , 1976, Inf. Sci..

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[8]  J. Baker Trainable grammars for speech recognition , 1979 .

[9]  Carl H. Smith,et al.  Inductive Inference: Theory and Methods , 1983, CSUR.

[10]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[11]  J. G. Wolff,et al.  Cognitive development as optimisation , 1987 .

[12]  S. Gull Bayesian Inductive Inference and Maximum Entropy , 1988 .

[13]  Yasubumi Sakakibara,et al.  Learning context-free grammars from structural data in polynomial time , 1988, COLT '88.

[14]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[15]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[16]  Stephen M. Omohundro,et al.  Best-First Model Merging for Dynamic Learning and Recognition , 1991, NIPS.

[17]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[18]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[19]  Frederick Jelinek,et al.  Basic Methods of Probabilistic Context Free Grammars , 1992 .

[20]  Dana Ron,et al.  The Power of Amnesia , 1993, NIPS.

[21]  Enrique Vidal,et al.  Learning Subsequential Transducers for Pattern Recognition Interpretation Tasks , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Andreas Stolcke,et al.  Best-first Model Merging for Hidden Markov Model Induction , 1994, ArXiv.

[23]  Andreas Stolcke,et al.  Multiple-pronunciation lexical modeling in a speaker independent speech understanding system , 1994, ICSLP.

[24]  José Oncina,et al.  Learning Stochastic Regular Grammars by Means of a State Merging Method , 1994, ICGI.

[25]  Andreas Stolcke,et al.  Bayesian learning of probabilistic language models , 1994 .

[26]  Pat Langley Simplicity and Representation Change in Grammar Induction , 1995 .