Reparameterizing Mirror Descent as Gradient Descent

Most of the recent successful applications of neural networks have been based on training with gradient descent updates. However, for some small networks, other mirror descent updates learn provably more efficiently when the target is sparse. We present a general framework for casting a mirror descent update as a gradient descent update on a different set of parameters. In some cases, the mirror descent reparameterization can be described as training a modified network with standard backpropagation. The reparameterization framework is versatile and covers a wide range of mirror descent updates, even cases where the domain is constrained. Our construction for the reparameterization argument is done for the continuous versions of the updates. Finding general criteria for the discrete versions to closely track their continuous counterparts remains an interesting open problem.

[1]  Wojciech Kotlowski,et al.  A case where a spindly two-layer linear network whips any neural network with a fully connected input layer , 2020, ArXiv.

[2]  Manfred K. Warmuth,et al.  Winnowing with Gradient Descent , 2020, COLT.

[3]  Varun Kanade,et al.  Implicit Regularization for Optimal Sparse Recovery , 2019, NeurIPS.

[4]  Manfred K. Warmuth,et al.  Robust Bi-Tempered Logistic Loss Based on Bregman Divergences , 2019, NeurIPS.

[5]  Yoram Singer,et al.  Exponentiated Gradient Meets Gradient Descent , 2019, 1902.01903.

[6]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[7]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[8]  Sayan Mukherjee,et al.  The Information Geometry of Mirror Descent , 2013, IEEE Transactions on Information Theory.

[9]  Maxim Raginsky,et al.  Continuous-time stochastic Mirror Descent on a network: Variance reduction, consensus, convergence , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[10]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[11]  William H. Sandholm,et al.  Population Games And Evolutionary Dynamics , 2010, Economic learning and social evolution.

[12]  Andrzej Cichocki,et al.  Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities , 2010, Entropy.

[13]  H. Zou,et al.  Addendum: Regularization and variable selection via the elastic net , 2005 .

[14]  Y. Mansour,et al.  Improved second-order bounds for prediction with expert advice , 2005, Machine Learning.

[15]  S. V. N. Vishwanathan,et al.  Leaving the Span , 2005, COLT.

[16]  J. Naudts Deformed exponentials and logarithms in generalized thermostatistics , 2002, cond-mat/0203489.

[17]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[18]  Manfred K. Warmuth,et al.  The Perceptron Algorithm Versus Winnow: Linear Versus Logarithmic Mistake Bounds when Few Input Variables are Relevant (Technical Note) , 1997, Artif. Intell..

[19]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[20]  Manfred K. Warmuth,et al.  The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[21]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[22]  W. L. Burke Applied Differential Geometry , 1985 .

[23]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[24]  Jiazhong Nie,et al.  Online PCA with Optimal Regret , 2016, J. Mach. Learn. Res..

[25]  Kathrin Abendroth,et al.  The Geometry Of Population Genetics , 2016 .

[26]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[27]  Manfred K. Warmuth,et al.  The{dollar}p{dollar}-Norm Generalization of the LMS Algorithm for Adaptive Filtering , 2022 .