Relative gradient optimization of the Jacobian term in unsupervised deep learning

Learning expressive probabilistic models correctly describing the data is a ubiquitous problem in machine learning. A popular approach for solving it is mapping the observations into a representation space with a simple joint distribution, which can typically be written as a product of its marginals -- thus drawing a connection with the field of nonlinear independent component analysis. Deep density models have been widely used for this task, but their likelihood-based training requires estimating the log-determinant of the Jacobian and is computationally expensive, thus imposing a trade-off between computation and expressive power. In this work, we propose a new approach for exact likelihood-based training of such neural networks. Based on relative gradients, we exploit the matrix structure of neural network parameters to compute updates efficiently even in high-dimensional spaces; the computational cost of the training is quadratic in the input size, in contrast with the cubic scaling of the naive approaches. This allows fast training with objective functions involving the log-determinant of the Jacobian without imposing constraints on its structure, in stark contrast to normalizing flows. An implementation of our method can be found at this https URL

[1]  Matthias W. Seeger,et al.  Large Scale Variational Inference and Experimental Design for Sparse Generalized Linear Models , 2008, Sampling-based Optimization in the Presence of Uncertainty.

[2]  Raquel Urtasun,et al.  The Reversible Residual Network: Backpropagation Without Storing Activations , 2017, NIPS.

[3]  Andreas Griewank,et al.  Evaluating derivatives - principles and techniques of algorithmic differentiation, Second Edition , 2000, Frontiers in applied mathematics.

[4]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[5]  Alexandre Lacoste,et al.  Neural Autoregressive Flows , 2018, ICML.

[6]  E. Tabak,et al.  DENSITY ESTIMATION BY DUAL ASCENT OF THE LOG-LIKELIHOOD ∗ , 2010 .

[7]  E. Tabak,et al.  A Family of Nonparametric Density Estimation Algorithms , 2013 .

[8]  Aapo Hyvärinen,et al.  Nonlinear ICA of Temporally Dependent Stationary Sources , 2017, AISTATS.

[9]  Jean-François Cardoso,et al.  Equivariant adaptive source separation , 1996, IEEE Trans. Signal Process..

[10]  Robert E. Mahony,et al.  Optimization Algorithms on Matrix Manifolds , 2007 .

[11]  Charles C. Margossian,et al.  A review of automatic differentiation and its efficient implementation , 2018, WIREs Data Mining Knowl. Discov..

[12]  L. Baird,et al.  One-step neural network inversion with PDF learning and emulation , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[13]  Aapo Hyvärinen,et al.  Nonlinear independent component analysis: Existence and uniqueness results , 1999, Neural Networks.

[14]  Eric Nalisnick,et al.  Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[15]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[16]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[17]  Aapo Hyvärinen,et al.  Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA , 2016, NIPS.

[18]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[19]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[20]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[21]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[22]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[23]  Ivan Kobyzev,et al.  Normalizing Flows: Introduction and Ideas , 2019, ArXiv.

[24]  Jascha Sohl-Dickstein,et al.  Invertible Convolutional Flow , 2019, NeurIPS.

[25]  Raúl Rojas,et al.  Neural Networks - A Systematic Introduction , 1996 .

[26]  Iain Murray,et al.  Masked Autoregressive Flow for Density Estimation , 2017, NIPS.

[27]  Max Welling,et al.  Improving Variational Auto-Encoders using Householder Flow , 2016, ArXiv.

[28]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[29]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[30]  Sertac Karaman,et al.  Invertibility of Convolutional Generative Networks from Partial Measurements , 2018, NeurIPS.

[31]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[32]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[33]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[34]  David Duvenaud,et al.  Invertible Residual Networks , 2018, ICML.

[35]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[36]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[37]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[38]  Silvere Bonnabel,et al.  Stochastic Gradient Descent on Riemannian Manifolds , 2011, IEEE Transactions on Automatic Control.

[39]  Ivan Kobyzev,et al.  Normalizing Flows: An Introduction and Review of Current Methods , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[41]  Arnold W. M. Smeulders,et al.  i-RevNet: Deep Invertible Networks , 2018, ICLR.

[42]  Francesco Piazza,et al.  New Riemannian metrics for improvement of convergence speed in ICA based learning algorithms , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[43]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[44]  Max Welling,et al.  Sylvester Normalizing Flows for Variational Inference , 2018, UAI.

[45]  Max Welling,et al.  Emerging Convolutions for Generative Normalizing Flows , 2019, ICML.

[46]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[47]  David Duvenaud,et al.  Neural Networks with Cheap Differential Operators , 2019, NeurIPS.

[48]  Ryan P. Adams,et al.  High-Dimensional Probability Estimation with Deep Density Models , 2013, ArXiv.

[49]  Aapo Hyvärinen,et al.  Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning , 2018, AISTATS.

[50]  Keiron O'Shea,et al.  An Introduction to Convolutional Neural Networks , 2015, ArXiv.

[51]  Bernhard Schölkopf,et al.  The Incomplete Rosetta Stone problem: Identifiability results for Multi-view Nonlinear ICA , 2019, UAI.

[52]  Barak A. Pearlmutter,et al.  Automatic differentiation in machine learning: a survey , 2015, J. Mach. Learn. Res..

[53]  David Duvenaud,et al.  FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models , 2018, ICLR.