Bidirectional Helmholtz Machines

Efficient unsupervised training and inference in deep generative models remains a challenging problem. One basic approach, called Helmholtz machine, involves training a top-down directed generative model together with a bottom-up auxiliary model used for approximate inference. Recent results indicate that better generative models can be obtained with better approximate inference procedures. Instead of improving the inference procedure, we here propose a new model which guarantees that the top-down and bottom-up distributions can efficiently invert each other. We achieve this by interpreting both the top-down and the bottom-up directed models as approximate inference distributions and by defining the model distribution to be the geometric mean of these two. We present a lower-bound for the likelihood of this model and we show that optimizing this bound regularizes the model so that the Bhattacharyya distance between the bottom-up and top-down approximate distributions is minimized. This approach results in state of the art generative models which prefer significantly deeper architectures while it allows for orders of magnitude more efficient approximate inference.

[1]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[2]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[3]  Yoshua Bengio,et al.  Reweighted Wake-Sleep , 2014, ICLR.

[4]  Alexander Zien,et al.  Semi-Supervised Classification by Low Density Separation , 2005, AISTATS.

[5]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[6]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[7]  Frank Nielsen,et al.  On weighting clustering , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[9]  Gholamreza Haffari,et al.  A Rate Distortion Approach for Semi-Supervised Conditional Random Fields , 2009, NIPS.

[10]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[11]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[12]  Alekseĭ Grigorʹevich Ivakhnenko,et al.  CYBERNETIC PREDICTING DEVICES , 1966 .

[13]  Martial Hebert,et al.  Semi-Supervised Self-Training of Object Detection Models , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[14]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[15]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[16]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[17]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[20]  Max Welling,et al.  Markov Chain Monte Carlo and Variational Inference: Bridging the Gap , 2014, ICML.

[21]  P. Dayan Helmholtz Machines and Wake-Sleep Learning , 2000 .

[22]  Nebojsa Jojic,et al.  Iterative Refinement of Approximate Posterior for Training Directed Belief Networks , 2015, ArXiv.

[23]  Nicolas Le Roux,et al.  Efficient Non-Parametric Function Induction in Semi-Supervised Learning , 2004, AISTATS.

[24]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[25]  Alexander Zien,et al.  Label Propagation and Quadratic Criterion , 2006 .

[26]  Noah D. Goodman,et al.  Learning Stochastic Inverses , 2013, NIPS.

[27]  Nebojsa Jojic,et al.  Iterative Refinement of the Approximate Posterior for Directed Belief Networks , 2015, NIPS.

[28]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[29]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[30]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[31]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[32]  Christian P. Robert,et al.  Introducing Monte Carlo Methods with R , 2009 .

[33]  Nicolas Le Roux,et al.  Learning a Generative Model of Images by Factoring Appearance and Shape , 2011, Neural Computation.

[34]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[35]  Neil D. Lawrence,et al.  Semi-supervised Learning via Gaussian Processes , 2004, NIPS.

[36]  Peter V. Gehler,et al.  The rate adapting poisson model for information retrieval and object recognition , 2006, ICML.

[37]  Jason Weston,et al.  Deep learning via semi-supervised embedding , 2008, ICML '08.

[38]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[39]  Ibrahim M. Elfadel,et al.  Convex Potentials and their Conjugates in Analog Mean-Field Optimization , 1995, Neural Computation.

[40]  Bing Zhang,et al.  Semi-supervised learning improves gene expression-based prediction of cancer recurrence , 2011, Bioinform..

[41]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[42]  Yoshua Bengio,et al.  Unsupervised Models of Images by Spikeand-Slab RBMs , 2011, ICML.

[43]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[44]  E. Nadaraya On Estimating Regression , 1964 .

[45]  Tim Salimans,et al.  Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression , 2012, ArXiv.

[46]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[47]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[50]  C. Pal,et al.  Fast Inference and Learning with Sparse Belief Propagation , 2005 .

[51]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[52]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[53]  John D. Lafferty,et al.  Semi-supervised learning using randomized mincuts , 2004, ICML.

[54]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[55]  Ruslan Salakhutdinov,et al.  Evaluating probabilities under high-dimensional latent variable models , 2008, NIPS.

[56]  Geoffrey E. Hinton,et al.  Varieties of Helmholtz Machine , 1996, Neural Networks.

[57]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[58]  Lourdes Agapito,et al.  Semi-supervised Learning Using an Unsupervised Atlas , 2014, ECML/PKDD.

[59]  Christopher K. I. Williams,et al.  Multiple Texture Boltzmann Machines , 2012, AISTATS.

[60]  Tom Schaul,et al.  Unit Tests for Stochastic Optimization , 2013, ICLR.

[61]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[62]  Ruslan Salakhutdinov,et al.  Learning Stochastic Feedforward Neural Networks , 2013, NIPS.

[63]  A. Hall,et al.  Adaptive Switching Circuits , 2016 .

[64]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[65]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[66]  W. Newey,et al.  Large sample estimation and hypothesis testing , 1986 .

[67]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[68]  Geoffrey E. Hinton,et al.  OPTIMAL PERCEPTUAL INFERENCE , 1983 .

[69]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[70]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[71]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.

[72]  Marc'Aurelio Ranzato,et al.  Semi-supervised learning of compact document representations with deep networks , 2008, ICML '08.

[73]  Ruslan Salakhutdinov,et al.  Accurate and conservative estimates of MRF log-likelihood using reverse annealing , 2014, AISTATS.

[74]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[75]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[76]  Antonio Torralba,et al.  Semi-Supervised Learning in Gigantic Image Collections , 2009, NIPS.

[77]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[78]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[79]  Geoffrey E. Hinton,et al.  Robust Boltzmann Machines for recognition and denoising , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[80]  Yoshua Bengio,et al.  Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[81]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[82]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[83]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[84]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[85]  Ryan P. Adams,et al.  Archipelago: nonparametric Bayesian semi-supervised learning , 2009, ICML '09.

[86]  G. S. Watson,et al.  Smooth regression analysis , 1964 .

[87]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[88]  Rong Yan,et al.  Mining Associated Text and Images with Dual-Wing Harmoniums , 2005, UAI.

[89]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[90]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[91]  Krzysztof J. Geras,et al.  International Conference on Learning Representations (ICLR) 2015 , 2015 .

[92]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[93]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[94]  Yoshua Bengio,et al.  Blocks and Fuel: Frameworks for deep learning , 2015, ArXiv.

[95]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[96]  S Kullback,et al.  LETTER TO THE EDITOR: THE KULLBACK-LEIBLER DISTANCE , 1987 .

[97]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[98]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[99]  Christian P. Robert,et al.  Introducing Monte Carlo Methods with R (Use R) , 2009 .

[100]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.