Rao-Blackwellized Stochastic Gradients for Discrete Distributions

We wish to compute the gradient of an expectation over a finite or countably infinite sample space having $K \leq \infty$ categories. When $K$ is indeed infinite, or finite but very large, the relevant summation is intractable. Accordingly, various stochastic gradient estimators have been proposed. In this paper, we describe a technique that can be applied to reduce the variance of any such estimator, without changing its bias---in particular, unbiasedness is retained. We show that our technique is an instance of Rao-Blackwellization, and we demonstrate the improvement it yields on a semi-supervised classification problem and a pixel attention task.

[1]  G. Casella,et al.  Rao-Blackwellisation of sampling schemes , 1996 .

[2]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[4]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[5]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[6]  J. Andrew Royle N‐Mixture Models for Estimating Population Size from Spatially Replicated Counts , 2004, Biometrics.

[7]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[8]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[9]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[10]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[11]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[12]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[13]  Daan Wierstra,et al.  Deep AutoRegressive Networks , 2013, ICML.

[14]  Sean Gerrish,et al.  Black Box Variational Inference , 2013, AISTATS.

[15]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[16]  Prabhat,et al.  A deep generative model for astronomical images of galaxies , 2015 .

[17]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Miguel Lázaro-Gredilla,et al.  Local Expectation Gradients for Black Box Variational Inference , 2015, NIPS.

[20]  Andriy Mnih,et al.  Variational Inference for Monte Carlo Objectives , 2016, ICML.

[21]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[22]  Combine Monte Carlo with Exhaustive Search : Effective Variational Inference and Policy Gradient Reinforcement Learning , 2016 .

[23]  Sergey Levine,et al.  MuProp: Unbiased Backpropagation for Stochastic Neural Networks , 2015, ICLR.

[24]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[25]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[26]  Jascha Sohl-Dickstein,et al.  REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models , 2017, NIPS.

[27]  David Duvenaud,et al.  Backpropagation through the Void: Optimizing control variates for black-box gradient estimation , 2017, ICLR.

[28]  Chen Liang,et al.  Memory Augmented Policy Optimization for Program Synthesis with Generalization , 2018, ArXiv.