Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group

We introduce a novel approach to perform first-order optimization with orthogonal and unitary constraints. This approach is based on a parametrization stemming from Lie group theory through the exponential map. The parametrization transforms the constrained optimization problem into an unconstrained one over a Euclidean space, for which common first-order optimization methods can be used. The theoretical results presented are general enough to cover the special orthogonal group, the unitary group and, in general, any connected compact Lie group. We discuss how this and other parametrizations can be computed efficiently through an implementation trick, making numerically complex parametrizations usable at a negligible runtime cost in neural networks. In particular, we apply our results to RNNs with orthogonal recurrent weights, yielding a new architecture called expRNN. We demonstrate how our method constitutes a more robust approach to optimization with orthogonal constraints, showing faster, accurate, and more stable convergence in several tasks designed to test RNNs.

[1]  Soumava Kumar Roy,et al.  Geometry Aware Constrained Optimization Techniques for Deep Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Levent Tunçel,et al.  Optimization algorithms on matrix manifolds , 2009, Math. Comput..

[3]  Yann LeCun,et al.  Tunable Efficient Unitary Neural Networks (EUNN) and their application to RNNs , 2016, ICML.

[4]  James Bailey,et al.  Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections , 2016, ICML.

[5]  Visa Koivunen,et al.  Steepest Descent Algorithms for Optimization Under Unitary Matrix Constraint , 2008, IEEE Transactions on Signal Processing.

[6]  Yann LeCun,et al.  Recurrent Orthogonal Networks and Long-Memory Tasks , 2016, ICML.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Steven Thomas Smith,et al.  Geometric Optimization Methods for Adaptive Filtering , 2013, ArXiv.

[9]  Qiang Ye,et al.  Complex Unitary Recurrent Neural Networks using Scaled Cayley Transform , 2018 .

[10]  Xianglong Liu,et al.  Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent Stiefel Manifolds in Deep Neural Networks , 2017, AAAI.

[11]  Silvere Bonnabel,et al.  Stochastic Gradient Descent on Riemannian Manifolds , 2011, IEEE Transactions on Automatic Control.

[12]  Christopher Joseph Pal,et al.  On orthogonality and learning recurrent networks with long term dependencies , 2017, ICML.

[13]  Jonathan H. Manton,et al.  Optimization algorithms exploiting unitary constraints , 2002, IEEE Trans. Signal Process..

[14]  Gunnar Rätsch,et al.  Learning Unitary Operators with Help From u(n) , 2016, AAAI.

[15]  Les E. Atlas,et al.  Full-Capacity Unitary Recurrent Neural Networks , 2016, NIPS.

[16]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[17]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[18]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[19]  Alan Edelman,et al.  The Geometry of Algorithms with Orthogonality Constraints , 1998, SIAM J. Matrix Anal. Appl..

[20]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[21]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Nicholas J. Higham,et al.  The Scaling and Squaring Method for the Matrix Exponential Revisited , 2005, SIAM J. Matrix Anal. Appl..

[24]  Awad H. Al-Mohy,et al.  Computing the Fréchet Derivative of the Matrix Exponential, with an Application to Condition Number Estimation , 2008, SIAM J. Matrix Anal. Appl..

[25]  Hiroyuki Kasai,et al.  Riemannian stochastic variance reduced gradient , 2016, SIAM J. Optim..

[26]  Suvrit Sra,et al.  Fast stochastic optimization on Riemannian manifolds , 2016, ArXiv.

[27]  Takayuki Okatani,et al.  Optimization on Submanifolds of Convolution Kernels in CNNs , 2016, ArXiv.

[28]  Basura Fernando,et al.  Generalized BackPropagation, Étude De Cas: Orthogonality , 2016, ArXiv.

[29]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[30]  Gary Bécigneul,et al.  Riemannian Adaptive Optimization Methods , 2018, ICLR.

[31]  Qiang Ye,et al.  Orthogonal Recurrent Neural Networks with Scaled Cayley Transform , 2017, ICML.

[32]  P. Absil,et al.  Erratum to: ``Global rates of convergence for nonconvex optimization on manifolds'' , 2016, IMA Journal of Numerical Analysis.

[33]  B. Hall Lie Groups, Lie Algebras, and Representations: An Elementary Introduction , 2004 .

[34]  Awad H. Al-Mohy,et al.  A New Scaling and Squaring Algorithm for the Matrix Exponential , 2009, SIAM J. Matrix Anal. Appl..

[35]  J. Milnor Curvatures of left invariant metrics on lie groups , 1976 .

[36]  Cleve B. Moler,et al.  Nineteen Dubious Ways to Compute the Exponential of a Matrix, Twenty-Five Years Later , 1978, SIAM Rev..

[37]  Bo Jiang,et al.  A framework of constraint preserving update schemes for optimization on Stiefel manifold , 2013, Math. Program..

[38]  Xiaohan Chen,et al.  Can We Gain More from Orthogonality Regularizations in Training Deep CNNs? , 2018, NeurIPS.