Markov Information Bottleneck to Improve Information Flow in Stochastic Neural Networks

While rate distortion theory compresses data under a distortion constraint, information bottleneck (IB) generalizes rate distortion theory to learning problems by replacing a distortion constraint with a constraint of relevant information. In this work, we further extend IB to multiple Markov bottlenecks (i.e., latent variables that form a Markov chain), namely Markov information bottleneck (MIB), which particularly fits better in the context of stochastic neural networks (SNNs) than the original IB. We show that Markov bottlenecks cannot simultaneously achieve their information optimality in a non-collapse MIB, and thus devise an optimality compromise. With MIB, we take the novel perspective that each layer of an SNN is a bottleneck whose learning goal is to encode relevant information in a compressed form from the data. The inference from a hidden layer to the output layer is then interpreted as a variational approximation to the layer’s decoding of relevant information in the MIB. As a consequence of this perspective, the maximum likelihood estimate (MLE) principle in the context of SNNs becomes a special case of the variational MIB. We show that, compared to MLE, the variational MIB can encourage better information flow in SNNs in both principle and practice, and empirically improve performance in classification, adversarial robustness, and multi-modal learning in MNIST.

[1]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[2]  Xiao Lin,et al.  Data-Efficient Mutual Information Neural Estimator , 2019, ArXiv.

[3]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[4]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[5]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[6]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[7]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[8]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[9]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[10]  Jaesik Choi,et al.  Layer-wise Learning of Stochastic Neural Networks with Information Bottleneck , 2017, ArXiv.

[11]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[12]  David J. Schwab,et al.  The Deterministic Information Bottleneck , 2015, Neural Computation.

[13]  Temple F. Smith Occam's razor , 1980, Nature.

[14]  Volker Roth,et al.  Meta-Gaussian Information Bottleneck , 2012, NIPS.

[15]  Naftali Tishby,et al.  Multivariate Information Bottleneck , 2001, Neural Computation.

[16]  Noam Slonim,et al.  The Information Bottleneck : Theory and Applications , 2006 .

[17]  Rana Ali Amjad,et al.  Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Hiroshi Yamakawa,et al.  FAVAE: Sequence Disentanglement using Information Bottleneck Principle , 2018, ArXiv.

[19]  Jaesik Choi,et al.  Parametric Information Bottleneck to \\Optimize Stochastic Neural Networks , 2018 .

[20]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[21]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[22]  Sergey Levine,et al.  InfoBot: Transfer and Exploration via the Information Bottleneck , 2019, ICLR.

[23]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[24]  Daniel Polani,et al.  Information Theory of Decisions and Actions , 2011 .

[25]  Santu Rana,et al.  Bayesian Optimization with Unknown Search Space , 2019, NeurIPS.

[26]  Ruslan Salakhutdinov,et al.  Learning Stochastic Feedforward Neural Networks , 2013, NIPS.

[27]  Michael Tschannen,et al.  On Mutual Information Maximization for Representation Learning , 2019, ICLR.

[28]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[29]  Tapani Raiko,et al.  Techniques for Learning Binary Stochastic Feedforward Neural Networks , 2014, ICLR.

[30]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[31]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[33]  Gunhee Kim,et al.  IB-GAN: Disentangled Representation Learning with Information Bottleneck GAN , 2018 .

[34]  Gal Chechik,et al.  Information Bottleneck for Gaussian Variables , 2003, J. Mach. Learn. Res..

[35]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  David P. Wipf,et al.  Compressing Neural Networks using the Variational Information Bottleneck , 2018, ICML.

[38]  Yoshua Bengio,et al.  Mutual Information Neural Estimation , 2018, ICML.

[39]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[40]  David J. Schwab,et al.  The Information Bottleneck and Geometric Clustering , 2017, Neural Computation.

[41]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[42]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.