Posterior Approximation using Stochastic Gradient Ascent with Adaptive Stepsize

Abstract Scalable algorithms of posterior approximation allow Bayesian nonparametrics such as Dirichlet process mixture to scale up to larger dataset at fractional cost. Recent algorithms, notably the stochastic variational inference performs local learning from minibatch. The main problem with stochastic variational inference is that it relies on closed form solution. Stochastic gradient ascent is a modern approach to machine learning and is widely deployed in the training of deep neural networks. In this work, we explore using stochastic gradient ascent as a fast algorithm for the posterior approximation of Dirichlet process mixture. However, stochastic gradient ascent alone is not optimal for learning. In order to achieve both speed and performance, we turn our focus to stepsize optimization in stochastic gradient ascent. As as intermediate approach, we first optimize stepsize using the momentum method. Finally, we introduce Fisher information to allow adaptive stepsize in our posterior approximation. In the experiments, we justify that our approach using stochastic gradient ascent do not sacrifice performance for speed when compared to closed form coordinate ascent learning on these datasets. Lastly, our approach is also compatible with deep ConvNet features as well as scalable to large class datasets such as Caltech256 and SUN397.

[1]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[2]  Ronald M. Summers,et al.  Unsupervised Joint Mining of Deep Features and Image Labels for Large-Scale Radiology Image Categorization and Scene Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[3]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[4]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[5]  Diederik P. Kingma,et al.  Stochastic Gradient VB and the Variational Auto-Encoder , 2013 .

[6]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[7]  Max Welling,et al.  Bayesian k-Means as a Maximization-Expectation Algorithm , 2009, Neural Computation.

[8]  Nizar Bouguila,et al.  Variational learning of a Dirichlet process of generalized Dirichlet distributions for simultaneous clustering and feature selection , 2013, Pattern Recognit..

[9]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[10]  H. Robbins A Stochastic Approximation Method , 1951 .

[11]  Trung Le,et al.  Discriminative Bayesian Nonparametric Clustering , 2017, IJCAI.

[12]  Juha Karhunen,et al.  Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes , 2010, J. Mach. Learn. Res..

[13]  Fan Zhang,et al.  Bayesian estimation of generalized Gamma mixture model based on variational EM algorithm , 2019, Pattern Recognit..

[14]  Shiqian Ma,et al.  Barzilai-Borwein Step Size for Stochastic Gradient Descent , 2016, NIPS.

[15]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[16]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[17]  Hao Wu,et al.  Semi-supervised dimensionality reduction of hyperspectral imagery using pseudo-labels , 2018, Pattern Recognit..

[18]  Yee Whye Teh,et al.  Collapsed Variational Dirichlet Process Mixture Models , 2007, IJCAI.

[19]  Sean Gerrish,et al.  Black Box Variational Inference , 2013, AISTATS.

[20]  D. Michael Titterington,et al.  The EM Algorithm, Variational Approximations and Expectation Propagation for Mixtures , 2011 .

[21]  Han Wang,et al.  Fast approximation of variational Bayes Dirichlet process mixture using the maximization-maximization algorithm , 2018, Int. J. Approx. Reason..

[22]  Angelo Cangelosi,et al.  Head pose estimation in the wild using Convolutional Neural Networks and adaptive gradient methods , 2017, Pattern Recognit..

[23]  Masa-aki Sato,et al.  Online Model Selection Based on the Variational Bayes , 2001, Neural Computation.

[24]  Xulun Ye,et al.  Multi-manifold clustering: A graph-constrained deep nonparametric method , 2019, Pattern Recognit..

[25]  Orhan Arikan,et al.  Maximum likelihood estimation of Gaussian mixture models using stochastic search , 2012, Pattern Recognit..

[26]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[27]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[28]  Juha Karhunen,et al.  Natural Conjugate Gradient in Variational Inference , 2007, ICONIP.

[29]  Xinhua Zhang,et al.  Robust Bayesian Max-Margin Clustering , 2014, NIPS.

[30]  Nizar Bouguila,et al.  Online Learning of Hierarchical Pitman–Yor Process Mixture of Generalized Dirichlet Distributions With Feature Selection , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[31]  Sami Bourouis,et al.  Variational learning of hierarchical infinite generalized Dirichlet mixture models and applications , 2016, Soft Comput..

[32]  Michael I. Jordan,et al.  Variational Bayesian Inference with Stochastic Search , 2012, ICML.

[33]  Bo Zhang,et al.  Discriminatively Boosted Image Clustering with Fully Convolutional Auto-Encoders , 2017, Pattern Recognit..

[34]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[35]  Wei Zhang,et al.  Infinite Bayesian one-class support vector machine based on Dirichlet process mixture clustering , 2018, Pattern Recognit..

[36]  Markus Flierl,et al.  Bayesian estimation of Dirichlet mixture model with variational inference , 2014, Pattern Recognit..

[37]  Andre Wibisono,et al.  Streaming Variational Bayes , 2013, NIPS.