Direct Optimization through arg max for Discrete Variational Auto-Encoder

Reparameterization of variational auto-encoders with continuous random variables is an effective method for reducing the variance of their gradient estimates. In the discrete case, one can perform reparametrization using the Gumbel-Max trick, but the resulting objective relies on an $\arg \max$ operation and is non-differentiable. In contrast to previous works which resort to softmax-based relaxations, we propose to optimize it directly by applying the direct loss minimization approach. Our proposal extends naturally to structured discrete latent variable models when evaluating the $\arg \max$ operation is tractable. We demonstrate empirically the effectiveness of the direct loss minimization technique in variational autoencoders with both unstructured and structured discrete latent variables.

[1]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[2]  Gerald B. Folland,et al.  Real Analysis: Modern Techniques and Their Applications , 1984 .

[3]  Colin Raffel,et al.  Learning Hard Alignments with Variational Inference , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  M. Titsias Local Expectation Gradients for Doubly Stochastic Variational Inference , 2015, 1503.01494.

[5]  David Duvenaud,et al.  Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference , 2017, NIPS.

[6]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[7]  Wang Ling,et al.  Learning to Compose Words into Sentences with Reinforcement Learning , 2016, ICLR.

[8]  Arash Vahdat,et al.  Improved Gradient-Based Optimization Over Discrete Distributions , 2018, ArXiv.

[9]  Arash Vahdat,et al.  DVAE++: Discrete Variational Autoencoders with Overlapping Transformations , 2018, ICML.

[10]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[11]  R. Duncan Luce,et al.  Individual Choice Behavior , 1959 .

[12]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[13]  Jason Tyler Rolfe,et al.  Discrete Variational Autoencoders , 2016, ICLR.

[14]  Sergey Levine,et al.  MuProp: Unbiased Backpropagation for Stochastic Neural Networks , 2015, ICLR.

[15]  Matt J. Kusner,et al.  Grammar Variational Autoencoder , 2017, ICML.

[16]  Graham Neubig,et al.  StructVAE: Tree-structured Latent Variable Models for Semi-supervised Semantic Parsing , 2018, ACL.

[17]  Michael I. Jordan,et al.  Variational Bayesian Inference with Stochastic Search , 2012, ICML.

[18]  Guoyin Wang,et al.  NASH: Toward End-to-End Neural Architecture for Generative Semantic Hashing , 2018, ACL.

[19]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[20]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[22]  Tom Minka,et al.  A* Sampling , 2014, NIPS.

[23]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[24]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[25]  D. McFadden Conditional logit analysis of qualitative choice behavior , 1972 .

[26]  Andreas Krause,et al.  Differentiable Learning of Submodular Models , 2017, NIPS 2017.

[27]  Ryan P. Adams,et al.  Randomized Optimum Models for Structured Prediction , 2012, AISTATS.

[28]  Ryan P. Adams,et al.  Composing graphical models with neural networks for structured representations and fast inference , 2016, NIPS.

[29]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[30]  Tamir Hazan,et al.  PAC-Bayesian approach for minimization of phoneme error rate , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Regina Barzilay,et al.  Junction Tree Variational Autoencoder for Molecular Graph Generation , 2018, ICML.

[32]  George Papandreou,et al.  Perturb-and-MAP random fields: Using discrete optimization to learn and sample from energy models , 2011, 2011 International Conference on Computer Vision.

[33]  Arash Vahdat,et al.  DVAE#: Discrete Variational Autoencoders with Relaxed Boltzmann Priors , 2018, NeurIPS.

[34]  Andriy Mnih,et al.  Variational Inference for Monte Carlo Objectives , 2016, ICML.

[35]  Mingyuan Zhou,et al.  ARM: Augment-REINFORCE-Merge Gradient for Stochastic Binary Networks , 2018, ICLR.

[36]  Scott W. Linderman,et al.  Learning Latent Permutations with Gumbel-Sinkhorn Networks , 2018, ICLR.

[37]  Jascha Sohl-Dickstein,et al.  REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models , 2017, NIPS.

[38]  Tommi S. Jaakkola,et al.  On the Partition Function and Random Maximum A-Posteriori Perturbations , 2012, ICML.

[39]  S. Nadarajah,et al.  Extreme Value Distributions: Theory and Applications , 2000 .

[40]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[41]  Tamir Hazan,et al.  Direct Loss Minimization for Structured Prediction , 2010, NIPS.

[42]  Yang Song,et al.  Training Deep Neural Networks via Direct Loss Minimization , 2015, ICML.

[43]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Sean Gerrish,et al.  Black Box Variational Inference , 2013, AISTATS.

[45]  Subhransu Maji,et al.  On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations , 2013, NIPS.

[46]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[47]  Eric P. Xing,et al.  Toward Controlled Generation of Text , 2017, ICML.

[48]  E. Gumbel Statistical Theory of Extreme Values and Some Practical Applications : A Series of Lectures , 1954 .

[49]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[50]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[51]  Ivan Titov,et al.  Differentiable Perturb-and-Parse: Semi-Supervised Parsing with a Structured Variational Autoencoder , 2018, ICLR.

[52]  Pieter Abbeel,et al.  Emergence of Grounded Compositional Language in Multi-Agent Populations , 2017, AAAI.

[53]  David Duvenaud,et al.  Backpropagation through the Void: Optimizing control variates for black-box gradient estimation , 2017, ICLR.

[54]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.