Structure Learning in Conditional Probability Models via an Entropic Prior and Parameter Extinction

We introduce an entropic prior for multinomial parameter estimation problems and solve for its maximum a posteriori (MAP) estimator. The prior is a bias for maximally structured and minimally ambiguous models. In conditional probability models with hidden state, iterative MAP estimation drives weakly supported parameters toward extinction, effectively turning them off. Thus, structure discovery is folded into parameter estimation. We then establish criteria for simplifying a probabilistic model's graphical structure by trimming parameters and states, with a guarantee that any such deletion will increase the posterior probability of the model. Trimming accelerates learning by sparsifying the model. All operations monotonically and maximally increase the posterior probability, yielding structure-learning algorithms only slightly slower than parameter estimation via expectation-maximization and orders of magnitude faster than search-based structure induction. When applied to hidden Markov model training, the resulting models show superior generalization to held-out test data. In many cases the resulting models are so sparse and concise that they are interpretable, with hidden states that strongly correlate with meaningful categories.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  L. M. M.-T. Theory of Probability , 1929, Nature.

[3]  A. Rapoport,et al.  Connectivity of random nets , 1951 .

[4]  H. Landau On some problems of random nets , 1952 .

[5]  C. Norris A structure for learning. , 1958, Nursing outlook.

[6]  Edward M. Wright,et al.  The number of connected sparsely edged graphs , 1977, J. Graph Theory.

[7]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[8]  B. Bollobás The evolution of random graphs , 1984 .

[9]  P. Erdos,et al.  On the evolution of random graphs , 1984 .

[10]  L. M. Hobbs AUTOMATIC GENERATION OF , 1987 .

[11]  George B. Stauffer,et al.  J.S. Bach as organist : his instruments, music, and performance practices , 1988 .

[12]  Lorien Y. Pratt,et al.  Comparing Biases for Minimal Network Construction with Back-Propagation , 1988, NIPS.

[13]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[14]  Geoffrey E. Hinton,et al.  Dimensionality Reduction and Prior Knowledge in E-Set Recognition , 1989, NIPS.

[15]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[16]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[17]  Philippe Flajolet,et al.  Gaussian limiting distributions for the number of components in combinatorial structures , 1990, J. Comb. Theory, Ser. A.

[18]  C. Rodr Entropic Priors , 1991 .

[19]  Alessandro Falaschi,et al.  Automatic derivation of HMM alternative pronunciation network topologies , 1991, EUROSPEECH.

[20]  Paul Ernest Stolorz Recasting deterministic annealing as constrained optimization , 1992 .

[21]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[22]  Svante Janson,et al.  The Birth of the Giant Component , 1993, Random Struct. Algorithms.

[23]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[24]  Shiro Ikeda Construction of Phoneme Models Model Search of Hidden Markov Models , 1993 .

[25]  Andreas Stolcke,et al.  Best-first Model Merging for Hidden Markov Model Induction , 1994, ArXiv.

[26]  Makoto Sugiyama,et al.  Automatic extraction of eyeblink for psychological experiment , 1994, Systems and Computers in Japan.

[27]  Yoshua Bengio,et al.  Diffusion of Credit in Markovian Models , 1994, NIPS.

[28]  A. Konagaya,et al.  Motif Extraction using an Improved Iterative Duplication Method for HMM Topology Learning , 1995 .

[29]  P. Laplace Théorie analytique des probabilités , 1995 .

[30]  F. Wolfertstetter,et al.  Structured Markov models for speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[31]  Gaston H. Gonnet,et al.  On the LambertW function , 1996, Adv. Comput. Math..

[32]  C. Rodr BAYESIAN ROBUSTNESS : A NEW LOOK FROM GEOMETRYCarlos , 1996 .

[33]  Amro El-Jaroudi,et al.  An algorithm to determine hidden Markov model topology , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[34]  C. C. Rodriguez Bayesian Robustness: A New Look from Geometry , 1996 .

[35]  Mari Ostendorf,et al.  HMM topology design using maximum likelihood successive state splitting , 1997, Comput. Speech Lang..

[36]  M. Brand Learning concise models of human activity from ambient video via a structure-inducing M-step estimator , 1997 .

[37]  Eric Bauer,et al.  Update Rules for Parameter Estimation in Bayesian Networks , 1997, UAI.

[38]  Vijay Balasubramanian,et al.  Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions , 1996, Neural Computation.

[39]  Alex Pentland,et al.  A Wearable Computer Based American Sign Language Recognizer , 1997, SEMWEB.

[40]  Tomio Takara,et al.  Isolated word recognition using the HMM structure selected by the genetic algorithm , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[42]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[43]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[44]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[45]  Matthew Brand,et al.  Pattern discovery via entropy minimization , 1999, AISTATS.

[46]  Samy Bengio,et al.  Taking on the curse of dimensionality in joint distributions using neural networks , 2000, IEEE Trans. Neural Networks Learn. Syst..

[47]  Matthew Brand,et al.  Discovery and Segmentation of Activities in Video , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[48]  D. A. Barry,et al.  Analytical approximations for real values of the Lambert W -function , 2000 .