Truncated Variational Expectation Maximization

We derive a novel variational expectation maximization approach based on truncated variational distributions. Truncated distributions are proportional to exact posteriors within a subset of a discrete state space and equal zero otherwise. The novel variational approach is realized by first generalizing the standard variational EM framework to include variational distributions with exact (‘hard’) zeros. A fully variational treatment of truncated distributions then allows for deriving novel and mathematically grounded results, which in turn can be used to formulate novel efficient algorithms to optimize the parameters of probabilistic generative models. We find the free energies which correspond to truncated distributions to be given by concise and efficiently computable expressions, while update equations for model parameters (M-steps) remain in their standard form. Furthermore, we obtain generic expressions for expectation values w.r.t. truncated distributions. Based on these observations, we show how efficient and easily applicable meta-algorithms can be formulated that guarantee a monotonic increase of the free energy. Example applications of the here derived framework provide novel theoretical results and learning procedures for latent variable models as well as mixture models including procedures to tightly couple sampling and variational optimization approaches. Furthermore, by considering a special case of truncated variational distributions, we can cleanly and fully embed the well-known ‘hard EM’ approaches into the variational EM framework, and we show that ‘hard EM’ (for models with discrete latents) provably optimizes a lower free energy bound of the data log-likelihood.

[1]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[2]  R. Hathaway Another interpretation of the EM algorithm for mixture distributions , 1986 .

[3]  R. Kass,et al.  Approximate Bayesian Inference in Conditionally Independent Hierarchical Models (Parametric Empirical Bayes Models) , 1989 .

[4]  Biing-Hwang Juang,et al.  The segmental K-means algorithm for estimating parameters of hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[5]  Lindsey A. Foreman,et al.  Generalisation of the Viterbi algorithm , 1992 .

[6]  G. Celeux,et al.  A Classification EM algorithm for clustering and two stochastic versions , 1992 .

[7]  Michael I. Jordan,et al.  Learning from Incomplete Data , 1994 .

[8]  Michael I. Jordan,et al.  Exploiting Tractable Substructures in Intractable Networks , 1995, NIPS.

[9]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[10]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[11]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[12]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[13]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[14]  Richard A. Andersen,et al.  Latent variable models for neural data analysis , 1999 .

[15]  Hagai Attias,et al.  A Variational Bayesian Framework for Graphical Models , 1999 .

[16]  Bernard Chazelle,et al.  The soft heap: an approximate priority queue with optimal error rate , 2000, JACM.

[17]  Tommi S. Jaakkola,et al.  Tutorial on variational approximation methods , 2000 .

[18]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[19]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[20]  Terrence J. Sejnowski,et al.  Variational Learning for Switching State-Space Models , 2001 .

[21]  D. Mackay Local Minima, Symmetry-breaking, and Model Pruning in Variational Free Energy Minimization , 2001 .

[22]  Volker Tresp,et al.  Generative binary codes , 2003, Formal Pattern Analysis & Applications.

[23]  Matthew J. Beal,et al.  The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures , 2003 .

[24]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[25]  Alexander Ilin,et al.  On the Effect of the Form of the Posterior Approximation in Variational Learning of ICA Models , 2005, Neural Processing Letters.

[26]  Ole Winther,et al.  Expectation Consistent Approximate Inference , 2005, J. Mach. Learn. Res..

[27]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[28]  Milos Hauskrecht,et al.  Noisy-OR Component Analysis and its Application to Link Analysis , 2006, J. Mach. Learn. Res..

[29]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[30]  Karl J. Friston,et al.  Variational free energy and the Laplace approximation , 2007, NeuroImage.

[31]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[32]  Florian Steinke,et al.  Bayesian Inference and Optimal Design in the Sparse Linear Model , 2007, AISTATS.

[33]  Jörg Lücke,et al.  Maximal Causes for Non-linear Component Extraction , 2008, J. Mach. Learn. Res..

[34]  Richard G. Baraniuk,et al.  Compressive Sensing , 2008, Computer Vision, A Reference Guide.

[35]  Michael I. Jordan,et al.  Optimization of Structured Mean Field Objectives , 2009, UAI.

[36]  Guillermo Sapiro,et al.  Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations , 2009, NIPS.

[37]  Manfred Opper,et al.  The Variational Gaussian Approximation Revisited , 2009, Neural Computation.

[38]  Noah A. Smith,et al.  Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization , 2010, ACL.

[39]  Julian Eggert,et al.  Binary Sparse Coding , 2010, LVA/ICA.

[40]  Julian Eggert,et al.  Expectation Truncation and the Benefits of Preselection In Training Generative Models , 2010, J. Mach. Learn. Res..

[41]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[42]  Jörg Lücke,et al.  Select and Sample - A Model of Efficient Neural Inference and Learning , 2011, NIPS.

[43]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[44]  Yair Weiss,et al.  From learning models of natural image patches to whole image restoration , 2011, 2011 International Conference on Computer Vision.

[45]  Valentin I. Spitkovsky,et al.  Lateen EM: Unsupervised Training with Multiple Objectives, Applied to Dependency Grammar Induction , 2011, EMNLP.

[46]  Richard E. Turner,et al.  Two problems with variational expectation maximisation for time-series models , 2011 .

[47]  Armen E. Allahverdyan,et al.  Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs , 2011, NIPS.

[48]  Pedro M. Domingos,et al.  Sum-product networks: A new deep architecture , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[49]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[50]  Jörg Lücke,et al.  Why MCA? Nonlinear sparse coding with spike-and-slab prior for neurally plausible image encoding , 2012, NIPS.

[51]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[52]  Zhenwen Dai,et al.  Unsupervised learning of translation invariant occlusive components , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Julian Eggert,et al.  Ternary Sparse Coding , 2012, LVA/ICA.

[54]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[55]  Yi Chang,et al.  Iterative Viterbi A* Algorithm for K-Best Sequential Decoding , 2012, ACL.

[56]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[57]  Zhenwen Dai,et al.  What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach , 2013, NIPS.

[58]  Benjamin Schrauwen,et al.  Factoring Variations in Natural Images with Deep Gaussian Mixture Models , 2014, NIPS.

[59]  Zhenwen Dai,et al.  Autonomous Document Cleaning—A Generative Approach to Reconstruct Strongly Corrupted Scanned Texts , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[61]  Jörg Lücke,et al.  A truncated EM approach for spike-and-slab sparse coding , 2012, J. Mach. Learn. Res..

[62]  Richard E. Turner,et al.  Efficient occlusive components analysis , 2014, J. Mach. Learn. Res..

[63]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[64]  Max Welling,et al.  Markov Chain Monte Carlo and Variational Inference: Bridging the Gap , 2014, ICML.

[65]  Richard E. Turner,et al.  Neural Adaptive Sequential Monte Carlo , 2015, NIPS.

[66]  Richard G. Baraniuk,et al.  A Probabilistic Theory of Deep Learning , 2015, ArXiv.

[67]  Edoardo M. Airoldi,et al.  Copula variational inference , 2015, NIPS.

[68]  David M. Blei,et al.  Stochastic Structured Variational Inference , 2014, AISTATS.

[69]  Jörg Lücke,et al.  Nonlinear Spike-And-Slab Sparse Coding for Interpretable Image Encoding , 2015, PloS one.

[70]  Farhan Abrol,et al.  Variational Tempering , 2016, AISTATS.

[71]  Erik B. Sudderth,et al.  Fast Learning of Clusters and Topics via Sparse Posteriors , 2016, ArXiv.

[72]  Michael U. Gutmann,et al.  Bayesian Optimization for Likelihood-Free Inference of Simulator-Based Statistical Models , 2015, J. Mach. Learn. Res..

[73]  Dustin Tran,et al.  Operator Variational Inference , 2016, NIPS.

[74]  Jörg Lücke,et al.  Select-and-Sample for Spike-and-Slab Sparse Coding , 2016, NIPS.

[75]  Daniel Hernández-Lobato,et al.  Black-Box Alpha Divergence Minimization , 2015, ICML.

[76]  Dustin Tran,et al.  Hierarchical Variational Models , 2015, ICML.

[77]  Arthur Gretton,et al.  GP-Select: Accelerating EM Using Adaptive Subspace Preselection , 2014, Neural Computation.

[78]  Jörg Lücke,et al.  Truncated variational EM for semi-supervised neural simpletrons , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[79]  Zhenwen Dai,et al.  Truncated Variational Sampling for 'Black Box' Optimization of Generative Models , 2017, LVA/ICA.

[80]  Jörg Lücke,et al.  Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and k-means , 2017, AISTATS.