Estimating Gradients for Discrete Random Variables by Sampling without Replacement

We derive an unbiased estimator for expectations over discrete random variables based on sampling without replacement, which reduces variance as it avoids duplicate samples. We show that our estimator can be derived as the Rao-Blackwellization of three different estimators. Combining our estimator with REINFORCE, we obtain a policy gradient estimator and we reduce its variance using a built-in control variate which is obtained without additional model evaluations. The resulting estimator is closely related to other gradient estimators. Experiments with a toy problem, a categorical Variational Auto-Encoder and a structured prediction problem show that our estimator is the only estimator that is consistently among the best estimators in both high and low entropy settings.

[1]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[2]  Stefano Ermon,et al.  Exact Sampling with Integer Linear Programs and Random Perturbations , 2016, AAAI.

[3]  Tamir Hazan,et al.  Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces , 2019, NeurIPS.

[4]  Max Welling,et al.  Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement , 2019, ICML.

[5]  Eric Moulines,et al.  Comparison of resampling schemes for particle filtering , 2005, ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005..

[6]  Sean Gerrish,et al.  Black Box Variational Inference , 2013, AISTATS.

[7]  Mingyuan Zhou,et al.  ARSM: Augment-REINFORCE-Swap-Merge Estimator for Gradient Backpropagation Through Categorical Variables , 2019, ICML.

[8]  Samy Bengio,et al.  Neural Combinatorial Optimization with Reinforcement Learning , 2016, ICLR.

[10]  R. Duncan Luce,et al.  Individual Choice Behavior: A Theoretical Analysis , 1979 .

[11]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[12]  R. Plackett The Analysis of Permutations , 1975 .

[13]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Chen Liang,et al.  Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing , 2018, NeurIPS.

[15]  G. Casella,et al.  Rao-Blackwellisation of sampling schemes , 1996 .

[16]  Miguel Lázaro-Gredilla,et al.  Local Expectation Gradients for Black Box Variational Inference , 2015, NIPS.

[17]  Anton Osokin,et al.  SEARNN: Training RNNs with Global-Local Losses , 2017, ICLR.

[18]  Carsten Lund,et al.  Priority sampling for estimation of arbitrary subset sums , 2007, JACM.

[19]  Max Welling,et al.  Buy 4 REINFORCE Samples, Get a Baseline for Free! , 2019, DeepRLStructPred@ICLR.

[20]  Andriy Mnih,et al.  Variational Inference for Monte Carlo Objectives , 2016, ICML.

[21]  Michael I. Jordan,et al.  Variational Bayesian Inference with Stochastic Search , 2012, ICML.

[22]  Tommi S. Jaakkola,et al.  Direct Optimization through arg max for Discrete Variational Auto-Encoder , 2018, NeurIPS.

[23]  Victor O. K. Li,et al.  Neural Machine Translation with Gumbel-Greedy Decoding , 2017, AAAI.

[24]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[25]  Dale Schuurmans,et al.  Reward Augmented Maximum Likelihood for Neural Structured Prediction , 2016, NIPS.

[26]  Yang Liu,et al.  Minimum Risk Training for Neural Machine Translation , 2015, ACL.

[27]  Joelle Pineau,et al.  An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[28]  Michael I. Jordan,et al.  Rao-Blackwellized Stochastic Gradients for Discrete Distributions , 2018, ICML.

[29]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[30]  Daan Wierstra,et al.  Deep AutoRegressive Networks , 2013, ICML.

[31]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[32]  R. Luce,et al.  Individual Choice Behavior: A Theoretical Analysis. , 1960 .

[33]  Tie-Yan Liu,et al.  Dual Learning for Machine Translation , 2016, NIPS.

[34]  Jascha Sohl-Dickstein,et al.  REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models , 2017, NIPS.

[35]  Des Raj,et al.  Some Estimators in Sampling with Varying Probabilities without Replacement , 1956 .

[36]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[37]  Geoffrey J. Gordon,et al.  Learning Beam Search Policies via Imitation Learning , 2018, NeurIPS.

[38]  P. Fearnhead,et al.  On‐line inference for hidden Markov models via particle filters , 2003 .

[39]  David Duvenaud,et al.  Backpropagation through the Void: Optimizing control variates for black-box gradient estimation , 2017, ICLR.

[40]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[41]  Marc'Aurelio Ranzato,et al.  Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[42]  Tom Minka,et al.  A* Sampling , 2014, NIPS.

[43]  Stefano Ermon,et al.  Stochastic Optimization of Sorting Networks via Continuous Relaxations , 2019, ICLR.

[44]  David Duvenaud,et al.  Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference , 2017, NIPS.

[45]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[46]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[47]  Sergey Levine,et al.  MuProp: Unbiased Backpropagation for Stochastic Neural Networks , 2015, ICLR.

[48]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[49]  Max Welling,et al.  Attention, Learn to Solve Routing Problems! , 2018, ICLR.

[50]  J. Yellott The relationship between Luce's Choice Axiom, Thurstone's Theory of Comparative Judgment, and the double exponential distribution , 1977 .

[51]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[52]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[53]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.