Grammars for Games: A Gradient-Based, Game-Theoretic Framework for Optimization in Deep Learning

Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.

[1]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[2]  David Balduzzi,et al.  Falsification and Future Performance , 2011, Algorithmic Probability and Friends.

[3]  J. Neumann,et al.  Theory of games and economic behavior , 1945, 100 Years of Math Milestones.

[4]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[5]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[6]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[7]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[8]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[9]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[10]  Matemática,et al.  Society for Industrial and Applied Mathematics , 2010 .

[11]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[12]  L. Bottou From machine learning to machine reasoning , 2011, Machine Learning.

[13]  R. Vohra,et al.  Calibrated Learning and Correlated Equilibrium , 1996 .

[14]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[15]  David Balduzzi,et al.  Towards a learning-theoretic analysis of spike-timing dependent plasticity , 2012, NIPS.

[16]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[17]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[18]  Yann LeCun,et al.  Open Problem: The landscape of the loss surfaces of multilayer networks , 2015, COLT.

[19]  David Balduzzi,et al.  Deep Online Convex Optimization by Putting Forecaster to Sleep , 2015, ArXiv.

[20]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[21]  Kenneth D. Harris,et al.  The Neural Marketplace: I. General Formalism and Linear Theory , 2014, bioRxiv.

[22]  James E. Tomberlin,et al.  On the Plurality of Worlds. , 1989 .

[23]  David Balduzzi,et al.  Randomized co-training: from cortical neurons to machine learning and back again , 2013, ArXiv.

[24]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[25]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[26]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[27]  Léon Bottou,et al.  From machine learning to machine reasoning , 2011, Machine Learning.

[28]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[29]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[30]  Nathan Lay,et al.  Supervised Aggregation of Classifiers using Artificial Prediction Markets , 2010, ICML.

[31]  Ohad Shamir,et al.  On Lower and Upper Bounds in Smooth and Strongly Convex Optimization , 2016, J. Mach. Learn. Res..

[32]  Philipp Slusallek,et al.  Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.

[33]  V. Lamme,et al.  The distinct modes of vision offered by feedforward and recurrent processing , 2000, Trends in Neurosciences.

[34]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[35]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[36]  Yoshua Bengio,et al.  Blocks and Fuel: Frameworks for deep learning , 2015, ArXiv.

[37]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[38]  M. Minsky The Society of Mind , 1986 .

[39]  Ohad Shamir,et al.  On Lower and Upper Bounds for Smooth and Strongly Convex Optimization Problems , 2015, ArXiv.

[40]  Yishay Mansour,et al.  From External to Internal Regret , 2005, J. Mach. Learn. Res..

[41]  Samuel J. Gershman,et al.  Computational rationality: A converging paradigm for intelligence in brains, minds, and machines , 2015, Science.

[42]  P. Dayan Twenty-Five Lessons from Computational Neuromodulation , 2012, Neuron.

[43]  Gábor Lugosi,et al.  Learning correlated equilibria in games with compact sets of strategies , 2007, Games Econ. Behav..

[44]  John S. Edwards,et al.  The Hedonistic Neuron: A Theory of Memory, Learning and Intelligence , 1983 .

[45]  Muhammad Ghifary,et al.  Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies , 2015, ArXiv.

[46]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[47]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[48]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[49]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[50]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[51]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[52]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[53]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[54]  Michael P. Wellman,et al.  Economic reasoning and artificial intelligence , 2015, Science.

[55]  Martin J. Wainwright,et al.  Information-theoretic lower bounds on the oracle complexity of convex optimization , 2009, NIPS.

[56]  O. G. Selfridge,et al.  Pandemonium: a paradigm for learning , 1988 .

[57]  David Balduzzi,et al.  Cortical prediction markets , 2014, AAMAS.

[58]  Barak A. Pearlmutter,et al.  Automatic Differentiation of Algorithms for Machine Learning , 2014, ArXiv.

[59]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[60]  H. Robbins A Stochastic Approximation Method , 1951 .

[61]  Martin A. Riedmiller,et al.  Reinforcement learning in feedback control , 2011, Machine Learning.

[62]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[63]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[64]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[65]  David I. Spivak The operad of wiring diagrams: formalizing a graphical language for databases, recursion, and plug-and-play circuits , 2013, ArXiv.

[66]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[67]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[68]  J. Wickens,et al.  Timing is not Everything: Neuromodulation Opens the STDP Gate , 2010, Front. Syn. Neurosci..

[69]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[70]  Mark D. Reid,et al.  Convergence Analysis of Prediction Markets via Randomized Subspace Descent , 2015, NIPS.

[71]  Yoshua Bengio,et al.  Blocks and Fuel , 2015 .

[72]  Joachim M. Buhmann,et al.  Kickback Cuts Backprop's Red-Tape: Biologically Plausible Credit Assignment in Neural Networks , 2014, AAAI.

[73]  Pieter R. Roelfsema,et al.  Attention-Gated Reinforcement Learning of Internal Representations for Classification , 2005, Neural Computation.

[74]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[75]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[76]  Jacob D. Abernethy,et al.  A Collaborative Mechanism for Crowdsourcing Prediction Problems , 2011, NIPS.

[77]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[78]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[79]  Giulio Tononi,et al.  What can neurons do for their brain? Communicate selectivity with bursts , 2013, Theory in Biosciences.

[80]  Daniel Cownden,et al.  Random feedback weights support learning in deep neural networks , 2014, ArXiv.

[81]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[82]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[83]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[84]  Jürgen Schmidhuber,et al.  Market-Based Reinforcement Learning in Partially Observable Worlds , 2001, ICANN.

[85]  Yoshua Bengio,et al.  Difference Target Propagation , 2014, ECML/PKDD.

[86]  Edoardo M. Airoldi,et al.  Statistical analysis of stochastic gradient methods for generalized linear models , 2014, ICML.

[87]  Maxim Raginsky,et al.  Information-Based Complexity, Feedback and Dynamics in Convex Programming , 2010, IEEE Transactions on Information Theory.

[88]  H. Seung,et al.  Learning in Spiking Neural Networks by Reinforcement of Stochastic Synaptic Transmission , 2003, Neuron.

[89]  Patrick Gallinari,et al.  A Framework for the Cooperation of Learning Algorithms , 1990, NIPS.

[90]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[91]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[92]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[93]  A G Barto,et al.  Learning by statistical cooperation of self-interested neuron-like computing elements. , 1985, Human neurobiology.

[94]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[95]  Francis Crick,et al.  The recent excitement about neural networks , 1989, Nature.

[96]  Andreas Griewank,et al.  Evaluating derivatives - principles and techniques of algorithmic differentiation, Second Edition , 2000, Frontiers in applied mathematics.

[97]  Haipeng Luo,et al.  Fast Convergence of Regularized Learning in Games , 2015, NIPS.

[98]  Rafal Butowt,et al.  Anterograde axonal transport, transcytosis, and recycling of neurotrophic factors , 2001, Molecular Neurobiology.

[99]  Geoffrey J. Gordon No-regret Algorithms for Online Convex Programs , 2006, NIPS.

[100]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[101]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[102]  Jean-Yves Audibert Optimization for Machine Learning , 1995 .

[103]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[104]  Edoardo M. Airoldi,et al.  Implicit Temporal Differences , 2014, ArXiv.

[105]  D. Rumelhart Parallel Distributed Processing Volume 1: Foundations , 1987 .

[106]  Yoshua Bengio,et al.  Deep Learning of Representations: Looking Forward , 2013, SLSP.

[107]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[108]  Donald C. Wunsch,et al.  Corrections To "Adaptive Critic Designs" , 1997, IEEE Trans. Neural Networks.

[109]  Amos J. Storkey,et al.  Machine Learning Markets , 2011, AISTATS.

[110]  David Balduzzi,et al.  Metabolic Cost as an Organizing Principle for Cooperative Learning , 2012, Adv. Complex Syst..

[111]  Eric B. Baum,et al.  Toward a Model of Intelligence as an Economy of Agents , 1999, Machine Learning.

[112]  Tim Roughgarden,et al.  Algorithmic Game Theory , 2007 .