Alternating Differentiation for Optimization Layers

The idea of embedding optimization problems into deep neural networks as optimization layers to encode constraints and inductive priors has taken hold in recent years. Most existing methods focus on implicitly differentiating Karush–Kuhn–Tucker (KKT) conditions in a way that requires expensive computations on the Jacobian matrix, which can be slow and memory-intensive. In this paper, we developed a new framework, named Alternating Differentiation (Alt-Diff), that differentiates optimization problems (here, specifically in the form of convex optimization problems with polyhedral constraints) in a fast and recursive way. Alt-Diff decouples the differentiation procedure into a primal update and a dual update in an alternating way. Accordingly, Alt-Diff substantially decreases the dimensions of the Jacobian matrix and thus significantly increases the computational speed of implicit differentiation. Further, we present the computational complexity of the forward and backward pass of Alt-Diff and show that Alt-Diff enjoys quadratic computational complexity in the backward pass. Another notable difference between Alt-Diff and state-of-the-arts is that Alt-Diff can be truncated for the optimization layer. We theoretically show that: 1) Alt-Diff can converge to consistent gradients obtained by differentiating KKT conditions; 2) the error between the gradient obtained by the truncated Alt-Diff and by differentiating KKT conditions is upper bounded by the same order of variables’ truncation error. Therefore, Alt-Diff can be truncated to further increases computational speed without sacrificing much accuracy. A series of comprehensive experiments demonstrate that Alt-Diff yields results comparable to the state-of-the-arts in far less time.

[1]  Samy Wu Fung,et al.  JFB: Jacobian-Free Backpropagation for Implicit Networks , 2021, AAAI.

[2]  Stephen Gould,et al.  Deep Declarative Networks , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Adam N. Elmachtoub,et al.  Smart "Predict, then Optimize" , 2017, Manag. Sci..

[4]  J. Zico Kolter,et al.  Joint inference and input optimization in equilibrium networks , 2021, NeurIPS.

[5]  Zhouchen Lin,et al.  On Training Implicit Models , 2021, NeurIPS.

[6]  Hongxu Chen,et al.  Is Attention Better Than Matrix Decomposition? , 2021, ICLR.

[7]  J. Bolte,et al.  Nonsmooth Implicit Differentiation for Machine Learning and Optimization , 2021, NeurIPS.

[8]  Marco Cuturi,et al.  Efficient and Modular Implicit Differentiation , 2021, NeurIPS.

[9]  Zheng Zhang,et al.  Graph Neural Networks Inspired by Classical Iterative Algorithms , 2021, ICML.

[10]  Yonina C. Eldar,et al.  Algorithm Unrolling: Interpretable, Efficient Deep Learning for Signal and Image Processing , 2019, IEEE Signal Processing Magazine.

[11]  Tias Guns,et al.  Interior Point Solving for LP-based prediction+optimisation , 2020, NeurIPS.

[12]  Vladlen Koltun,et al.  Multiscale Deep Equilibrium Models , 2020, NeurIPS.

[13]  Anders P. Eriksson,et al.  Implicitly Defined Layers in Neural Networks , 2020, ArXiv.

[14]  Tias Guns,et al.  Smart Predict-and-Optimize for Hard Combinatorial Optimization Problems , 2019, AAAI.

[15]  Stephen P. Boyd,et al.  Differentiable Convex Optimization Layers , 2019, NeurIPS.

[16]  Anders P. Eriksson,et al.  Implicit Surface Representations As Layers in Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  J. Z. Kolter,et al.  Deep Equilibrium Models , 2019, NeurIPS.

[18]  Priya L. Donti,et al.  SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver , 2019, ICML.

[19]  Stephen P. Boyd,et al.  Differentiating through a cone program , 2019, Journal of Applied and Numerical Optimization.

[20]  Yee Whye Teh,et al.  Augmented Neural ODEs , 2019, NeurIPS.

[21]  Richard A. Newcombe,et al.  DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jeremy Nixon,et al.  Understanding and correcting pathologies in the training of learned optimizers , 2018, ICML.

[23]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[24]  André F. T. Martins,et al.  Sparse and Constrained Attention for Neural Machine Translation , 2018, ACL.

[25]  Shane T. Barratt On the Differentiability of the Solution to Convex Optimization Problems , 2018, 1804.05098.

[26]  Saeed Ghadimi,et al.  Approximation Methods for Bilevel Programming , 2018, 1802.02246.

[27]  Stephen P. Boyd,et al.  OSQP: an operator splitting solver for quadratic programs , 2017, 2018 UKACC 12th International Conference on Control (CONTROL).

[28]  Gordon Wetzstein,et al.  Unrolled Optimization with Deep Priors , 2017, ArXiv.

[29]  Andrew McCallum,et al.  End-to-End Learning for Structured Prediction Energy Networks , 2017, ICML.

[30]  J. Zico Kolter,et al.  OptNet: Differentiable Optimization as a Layer in Neural Networks , 2017, ICML.

[31]  Priya L. Donti,et al.  Task-based End-to-end Model Learning in Stochastic Optimization , 2017, NIPS.

[32]  Zhu Han,et al.  Resource Management in Cloud Networking Using Economic Analysis and Pricing Models: A Survey , 2017, IEEE Communications Surveys & Tutorials.

[33]  Lei Xu,et al.  Input Convex Neural Networks : Supplementary Material , 2017 .

[34]  Nicola Bui,et al.  A Survey of Anticipatory Mobile Networking: Context-Based Classification, Prediction Methodologies, and Optimization Techniques , 2016, IEEE Communications Surveys & Tutorials.

[35]  Anoop Cherian,et al.  On Differentiating Parameterized Argmin and Argmax Problems with Application to Bi-level Optimization , 2016, ArXiv.

[36]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[37]  Andrew McCallum,et al.  Structured Prediction Energy Networks , 2015, ICML.

[38]  Benjamin Pfaff,et al.  Perturbation Analysis Of Optimization Problems , 2016 .

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Karl Kunisch,et al.  A Bilevel Optimization Approach for Parameter Learning in Variational Models , 2013, SIAM J. Imaging Sci..

[41]  Justin Domke,et al.  Generic Methods for Optimization-Based Modeling , 2012, AISTATS.

[42]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[43]  Lei Guo,et al.  Stochastic Distribution Control System Design: A Convex Optimization Approach , 2010 .

[44]  Stephen P. Boyd,et al.  Real-Time Convex Optimization in Signal Processing , 2010, IEEE Signal Processing Magazine.

[45]  Chuan-Sheng Foo,et al.  Efficient multiple hyperparameter learning for log-linear models , 2007, NIPS.

[46]  R. Glowinski,et al.  Augmented Lagrangian and Operator-Splitting Methods in Nonlinear Mechanics , 1987 .

[47]  Michael A. Saunders,et al.  LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares , 1982, TOMS.

[48]  William A. Kirk,et al.  A Fixed Point Theorem for Mappings which do not Increase Distances , 1965 .