Wasserstein Dependency Measure for Representation Learning

Mutual information maximization has emerged as a powerful learning objective for unsupervised representation learning obtaining state-of-the-art performance in applications such as object recognition, speech recognition, and reinforcement learning. However, such approaches are fundamentally limited since a tight lower bound of mutual information requires sample size exponential in the mutual information. This limits the applicability of these approaches for prediction tasks with high mutual information, such as in video understanding or reinforcement learning. In these settings, such techniques are prone to overfit, both in theory and in practice, and capture only a few of the relevant factors of variation. This leads to incomplete representations that are not optimal for downstream tasks. In this work, we empirically demonstrate that mutual information-based representation learning approaches do fail to learn complete representations on a number of designed and real-world tasks. To mitigate these problems we introduce the Wasserstein dependency measure, which learns more complete representations by using the Wasserstein distance instead of the KL divergence in the mutual information estimator. We show that a practical approximation to this theoretically motivated solution, constructed using Lipschitz constraint techniques from the GAN literature, achieves substantially improved results on tasks where incomplete representations are a major challenge.

[1]  A. Rényi On Measures of Entropy and Information , 1961 .

[2]  M. R. Schroeder,et al.  Adaptive predictive coding of speech signals , 1970, Bell Syst. Tech. J..

[3]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[4]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[5]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[6]  William Bialek,et al.  Entropy and information in neural spike trains: progress on the sampling problem. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[8]  C. Villani Optimal Transport: Old and New , 2008 .

[9]  Karl J. Friston,et al.  Predictive coding under the free-energy principle , 2009, Philosophical Transactions of the Royal Society B: Biological Sciences.

[10]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[11]  Rajesh P. N. Rao,et al.  Predictive Coding , 2019, A Blueprint for the Hard Problem of Consciousness.

[12]  A. Clark Whatever next? Predictive brains, situated agents, and the future of cognitive science. , 2013, The Behavioral and brain sciences.

[13]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[14]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[15]  William Bialek,et al.  Information processing in living systems , 2014, 1412.8752.

[16]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[17]  G. Crooks On Measures of Entropy and Information , 2015 .

[18]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[20]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[21]  Michael J. Berry,et al.  Predictive information in a sensory population , 2013, Proceedings of the National Academy of Sciences.

[22]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[24]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[25]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[26]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[27]  Andrew Zisserman,et al.  Multi-task Self-Supervised Visual Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[29]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[30]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[31]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[32]  Aaron C. Courville,et al.  Adversarially Learned Inference , 2016, ICLR.

[33]  Andriy Mnih,et al.  Disentangling by Factorising , 2018, ICML.

[34]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[35]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[36]  Sergey Levine,et al.  EMI: Exploration with Mutual Information Maximizing State and Action Embeddings , 2018, ArXiv.

[37]  Nando de Freitas,et al.  Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[38]  Aäron van den Oord,et al.  On variational lower bounds of mutual information , 2018 .

[39]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[40]  Jonathan Tompson,et al.  Learning Actionable Representations from Visual Observations , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[41]  Andrew Zisserman,et al.  Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[43]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[44]  Yoshua Bengio,et al.  Mutual Information Neural Estimation , 2018, ICML.

[45]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[46]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[47]  Alexander A. Alemi,et al.  On Variational Bounds of Mutual Information , 2019, ICML.

[48]  Sergey Levine,et al.  EMI: Exploration with Mutual Information , 2018, ICML.

[49]  Cem Anil,et al.  Sorting out Lipschitz function approximation , 2018, ICML.

[50]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[51]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[52]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[53]  Karl Stratos,et al.  Formal Limitations on the Measurement of Mutual Information , 2018, AISTATS.