The simple essence of automatic differentiation

Automatic differentiation (AD) in reverse mode (RAD) is a central component of deep learning and other uses of large-scale optimization. Commonly used RAD algorithms such as backpropagation, however, are complex and stateful, hindering deep understanding, improvement, and parallel execution. This paper develops a simple, generalized AD algorithm calculated from a simple, natural specification. The general algorithm is then specialized by varying the representation of derivatives. In particular, applying well-known constructions to a naive representation yields two RAD algorithms that are far simpler than previously known. In contrast to commonly used RAD implementations, the algorithms defined here involve no graphs, tapes, variables, partial derivatives, or mutation. They are inherently parallel-friendly, correct by construction, and usable directly from an existing programming language with no need for new data types or programming style, thanks to use of an AD-agnostic compiler plugin.

[1]  Barak A. Pearlmutter,et al.  Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator , 2008, TOPL.

[2]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[3]  Barak A. Pearlmutter,et al.  Lazy multivariate higher-order forward-mode AD , 2007, POPL '07.

[4]  Chung-chieh Shan,et al.  Functional pearl: implicit configurations--or, type classes reflect the values of types , 2004, Haskell '04.

[5]  F. William Lawvere,et al.  Conceptual Mathematics: A First Introduction to Categories , 1997 .

[6]  Johan Jeuring,et al.  A generic deriving mechanism for Haskell , 2010, Haskell '10.

[7]  Simon L. Peyton Jones,et al.  Stretching the Storage Manager: Weak Pointers and Stable Names in Haskell , 1999, IFL.

[8]  Joachim Lambek,et al.  Cartesian Closed Categories and Typed Lambda- calculi , 1985, Combinators and Functional Programming Languages.

[9]  Roland Carl Backhouse,et al.  Algebraic and Coalgebraic Methods in the Mathematics of Program Construction , 2000, Lecture Notes in Computer Science.

[10]  J. Roger Hindley,et al.  To H.B. Curry: Essays on Combinatory Logic, Lambda Calculus, and Formalism , 1980 .

[11]  Louis B. Rall,et al.  Automatic Differentiation: Techniques and Applications , 1981, Lecture Notes in Computer Science.

[12]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[13]  M. B. Pour-El,et al.  COMPUTABILITY AND NONCOMPUTABILITY IN CLASSICAL ANALYSIS , 1983 .

[14]  Uwe Naumann,et al.  Optimal Jacobian accumulation is NP-complete , 2007, Math. Program..

[15]  Griewank,et al.  On automatic differentiation , 1988 .

[16]  R. E. Wengert,et al.  A simple automatic derivative evaluation program , 1964, Commun. ACM.

[17]  Johan Jeuring,et al.  A generic deriving mechanism for Haskell , 2010 .

[18]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[19]  M. Spivak Calculus On Manifolds: A Modern Approach To Classical Theorems Of Advanced Calculus , 2019 .

[20]  John C. Reynolds,et al.  Definitional Interpreters for Higher-Order Programming Languages , 1972, ACM '72.

[21]  T. C. Hu,et al.  Computation of Matrix Chain Products. Part I, Part II. , 1981 .

[22]  Barak A. Pearlmutter,et al.  Nesting forward-mode AD in a functional framework , 2008, High. Order Symb. Comput..

[23]  Mitchell Wand,et al.  Continuation-Based Program Transformation Strategies , 1980, JACM.

[24]  Marian Boykan Pour-El,et al.  Differentiability properties of computable functions - a summary , 1978, Acta Cybern..

[25]  Jerzy Karczmarczuk Functional Coding of Differential Forms , 1999 .

[26]  S. Lane Categories for the Working Mathematician , 1971 .

[27]  Andreas Griewank,et al.  Evaluating derivatives - principles and techniques of algorithmic differentiation, Second Edition , 2000, Frontiers in applied mathematics.

[28]  Hugo Daniel Macedo,et al.  Typing linear algebra: A biproduct-oriented approach , 2013, Sci. Comput. Program..

[29]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[30]  Ralf Hinze,et al.  Memo functions‚ polytypically! , 2000 .

[31]  Andy Gill,et al.  Type-safe observable sharing in Haskell , 2009, Haskell.

[32]  David I. Spivak,et al.  Backprop as Functor: A compositional perspective on supervised learning , 2017, 2019 34th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS).

[33]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[34]  Conal Elliott Beautiful differentiation , 2009, ICFP.

[35]  Andrew W. Appel,et al.  Compiling with Continuations , 1991 .

[36]  Jerzy Karczmarczuk,et al.  Adjoint Codes in Functional Framework , 2000 .

[37]  B. Speelpenning Compiling Fast Partial Derivatives of Functions Given by Algorithms , 1980 .

[38]  Steve Awodey,et al.  Category Theory , 2006 .

[39]  Conal Elliott Compiling to categories , 2017, Proc. ACM Program. Lang..

[40]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[41]  Richard S. Bird,et al.  Algebra of programming , 1997, Prentice Hall International series in computer science.

[42]  X. Yi On Automatic Differentiation , 2005 .

[43]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[44]  W. Greub Linear Algebra , 1981 .

[45]  R. Rockafellar Characterization of the subdifferentials of convex functions , 1966 .

[46]  Olivier Danvy,et al.  Defunctionalization at work , 2001, PPDP '01.

[47]  Jerzy Karczmarczuk Functional Differentiation of Computer Programs , 2001, High. Order Symb. Comput..

[48]  L. Shen,et al.  Linear Algebra , 1968 .

[49]  Andrew Kennedy,et al.  Compiling with continuations, continued , 2007, ICFP '07.

[50]  Erik Poll,et al.  Algebra of Programming by Richard Bird and Oege de Moor, Prentice Hall, 1996 (dated 1997). , 1999 .

[51]  ElliottConal The simple essence of automatic differentiation , 2018 .

[52]  Jeremy Gibbons,et al.  Calculating Functional Programs , 2000, Algebraic and Coalgebraic Methods in the Mathematics of Program Construction.