Provably Correct Automatic Subdifferentiation for Qualified Programs

The \emph{Cheap Gradient Principle}~\citep{Griewank:2008:EDP:1455489} --- the computational cost of computing a $d$-dimensional vector of partial derivatives of a scalar function is nearly the same (often within a factor of $5$) as that of simply computing the scalar function itself --- is of central importance in optimization; it allows us to quickly obtain (high-dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures. The current state of affairs is markedly different with regards to computing sub-derivatives: widely used ML libraries, including TensorFlow and PyTorch, do \emph{not} correctly compute (generalized) sub-derivatives even on simple differentiable examples. This work considers the question: is there a \emph{Cheap Sub-gradient Principle}? Our main result shows that, under certain restrictions on our library of non-smooth functions (standard in non-linear programming), provably correct generalized sub-derivatives can be computed at a computational cost that is within a (dimension-free) factor of $6$ of the cost of computing the scalar function itself.

[1]  J. Abadie ON THE KUHN-TUCKER THEOREM. , 1966 .

[2]  F. J. Gould,et al.  A NECESSARY AND SUFFICIENT QUALIFICATION FOR CONSTRAINED OPTIMIZATION , 1971 .

[3]  D. W. Peterson A REVIEW OF CONSTRAINT QUALIFICATIONS IN FINITE-DIMENSIONAL SPACES* , 1973 .

[4]  F. Clarke Generalized gradients and applications , 1975 .

[5]  J. Morgenstern,et al.  How to compute fast a function and all its derivatives: a variation on the theorem of Baur-strassen , 1985, SIGA.

[6]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[7]  Stephen Smale,et al.  On a theory of computation over the real numbers; NP completeness, recursive functions and universal machines , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[8]  Griewank,et al.  On automatic differentiation , 1988 .

[9]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[10]  Andreas Griewank,et al.  Evaluating derivatives - principles and techniques of algorithmic differentiation, Second Edition , 2000, Frontiers in applied mathematics.

[11]  M. Coste AN INTRODUCTION TO O-MINIMAL GEOMETRY , 2002 .

[12]  O. Mangasarian On Concepts of Directional Differentiability , 2004 .

[13]  Yurii Nesterov,et al.  Lexicographic differentiation of nonsmooth functions , 2005, Math. Program..

[14]  B. Mordukhovich Variational Analysis and Generalized Differentiation II: Applications , 2006 .

[15]  Andreas Griewank,et al.  Who Invented the Reverse Mode of Differentiation , 2012 .

[16]  Paul I. Barton,et al.  Evaluating an element of the Clarke generalized Jacobian of a composite piecewise differentiable function , 2013, TOMS.

[17]  Andreas Griewank,et al.  ON AUTOMATIC DIFFERENTIATION AND ALGORITHMIC LINEARIZATION , 2014 .

[18]  Paul I. Barton,et al.  A vector forward mode of automatic differentiation for generalized derivative evaluation , 2015, Optim. Methods Softw..

[19]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[20]  Neil D. Lawrence,et al.  Auto-Differentiating Linear Algebra , 2017, ArXiv.

[21]  Stéphan Thomassé,et al.  On the complexity of partial derivatives , 2016, STACS.

[22]  Barak A. Pearlmutter,et al.  Automatic differentiation in machine learning: a survey , 2015, J. Mach. Learn. Res..

[23]  Andreas Griewank,et al.  Algorithmic differentiation for piecewise smooth functions: a case study for robust optimization , 2018, Optim. Methods Softw..

[24]  Kamil A. Khan Branch-locking AD techniques for nonsmooth composite functions and nonsmooth implicit functions , 2018, Optim. Methods Softw..