Neural Optimal Control for Representation Learning

The intriguing connections recently established between neural networks and dynamical systems have invited deep learning researchers to tap into the well-explored principles of differential calculus. Notably, the adjoint sensitivity method used in neural ordinary differential equations (Neural ODEs) has cast the training of neural networks as a control problem in which neural modules operate as continuous-time homeomorphic transformations of features. Typically, these methods optimize a single set of parameters governing the dynamical system for the whole data set, forcing the network to learn complex transformations that are functionally limited and computationally heavy. Instead, we propose learning a data-conditioned distribution of \emph{optimal controls} over the network dynamics, emulating a form of input-dependent fast neural plasticity. We describe a general method for training such models as well as convergence proofs assuming mild hypotheses about the ODEs and show empirically that this method leads to simpler dynamics and reduces the computational cost of Neural ODEs. We evaluate this approach for unsupervised image representation learning; our new "functional" auto-encoding model with ODEs, AutoencODE, achieves state-of-the-art image reconstruction quality on CIFAR-10, and exhibits substantial improvements in unsupervised classification over existing auto-encoding models.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[3]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[4]  E Weinan,et al.  A mean-field optimal control formulation of deep learning , 2018, Research in the Mathematical Sciences.

[5]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[8]  Eldad Haber,et al.  Deep Neural Networks Motivated by Partial Differential Equations , 2018, Journal of Mathematical Imaging and Vision.

[9]  Stefanie Jegelka,et al.  ResNet with one-neuron hidden layers is a Universal Approximator , 2018, NeurIPS.

[10]  Carola-Bibiane Schönlieb,et al.  Deep learning as optimal control problems: models and numerical methods , 2019, Journal of Computational Dynamics.

[11]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[12]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[13]  David Sussillo,et al.  Neural circuits as computational dynamical systems , 2014, Current Opinion in Neurobiology.

[14]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[15]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[17]  Bernhard Schölkopf,et al.  From Variational to Deterministic Autoencoders , 2019, ICLR.

[18]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[19]  Evangelos A. Theodorou,et al.  Deep Learning Theory Review: An Optimal Control and Dynamical Systems Perspective , 2019, ArXiv.

[20]  Maxim Raginsky,et al.  Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit , 2019, ArXiv.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Levon Nurbekyan,et al.  A machine learning framework for solving high-dimensional mean field game and mean field control problems , 2020, Proceedings of the National Academy of Sciences.

[24]  W. E A Proposal on Machine Learning via Dynamical Systems , 2017 .

[25]  Frank Noé,et al.  Equivariant Flows: sampling configurations for multi-body systems with symmetric energies , 2019, ArXiv.

[26]  David Duvenaud,et al.  FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models , 2018, ICLR.

[27]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[28]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[29]  Eugene M. Izhikevich,et al.  Dynamical Systems in Neuroscience: The Geometry of Excitability and Bursting , 2006 .

[30]  Yee Whye Teh,et al.  Augmented Neural ODEs , 2019, NeurIPS.

[31]  Adam M. Oberman,et al.  How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization , 2020, ICML.

[32]  Oriol Vinyals,et al.  Learning Implicit Generative Models with the Method of Learned Moments , 2018, ICML.

[33]  Lingfeng Wang,et al.  Deep Adaptive Image Clustering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Shiyu Chang,et al.  AutoGAN: Neural Architecture Search for Generative Adversarial Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Igor Mordatch,et al.  Implicit Generation and Generalization with Energy Based Models , 2018 .

[36]  Long Chen,et al.  Maximum Principle Based Algorithms for Deep Learning , 2017, J. Mach. Learn. Res..

[37]  M. Breakspear Dynamic models of large-scale brain activity , 2017, Nature Neuroscience.

[38]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[39]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[40]  Eldad Haber,et al.  Reversible Architectures for Arbitrarily Deep Residual Neural Networks , 2017, AAAI.

[41]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[42]  Xu Ji,et al.  Invariant Information Clustering for Unsupervised Image Classification and Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).