Optimization on Submanifolds of Convolution Kernels in CNNs

Kernel normalization methods have been employed to improve robustness of optimization methods to reparametrization of convolution kernels, covariate shift, and to accelerate training of Convolutional Neural Networks (CNNs). However, our understanding of theoretical properties of these methods has lagged behind their success in applications. We develop a geometric framework to elucidate underlying mechanisms of a diverse range of kernel normalization methods. Our framework enables us to expound and identify geometry of space of normalized kernels. We analyze and delineate how state-of-the-art kernel normalization methods affect the geometry of search spaces of the stochastic gradient descent (SGD) algorithms in CNNs. Following our theoretical results, we propose a SGD algorithm with assurance of almost sure convergence of the methods to a solution at single minimum of classification loss of CNNs. Experimental results show that the proposed method achieves state-of-the-art performance for major image classification benchmarks with CNNs.

[1]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[2]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  A. Laub,et al.  A Schur-Fréchet Algorithm for Computing the Logarithm and Exponential of a Matrix , 1998, SIAM J. Matrix Anal. Appl..

[4]  Dirk A. Lorenz Book Review: Ulf Grenander and Michael Miller, Pattern Theory. From representation to inference. , 2007 .

[5]  Xiaowei Zhou,et al.  3D Shape Reconstruction from 2D Landmarks: A Convex Formulation , 2014, ArXiv.

[6]  Vittorio Murino,et al.  Algorithmic Advances in Riemannian Geometry and Applications: For Machine Learning, Computer Vision, Statistics, and Optimization , 2016 .

[7]  Subhransu Maji,et al.  Visualizing and Understanding Deep Texture Representations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[9]  Silvere Bonnabel,et al.  Stochastic Gradient Descent on Riemannian Manifolds , 2011, IEEE Transactions on Automatic Control.

[10]  Michael I. Miller,et al.  Pattern Theory: From Representation to Inference , 2007 .

[11]  Ruslan Salakhutdinov,et al.  Scaling up Natural Gradient by Sparsely Factorizing the Inverse Fisher Matrix , 2015, ICML.

[12]  Franklin T. Luk,et al.  A Rotation Method for Computing the QR-Decomposition , 1986 .

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Cristian Sminchisescu,et al.  Matrix Backpropagation for Deep Networks with Structured Layers , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  C. Loan,et al.  Nineteen Dubious Ways to Compute the Exponential of a Matrix , 1978 .

[16]  Ales Leonardis,et al.  Compositional Hierarchical Representation of Shape Manifolds for Classification of Non-manifold Shapes , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Rama Chellappa,et al.  Statistical analysis on Stiefel and Grassmann manifolds with applications in computer vision , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  Rama Chellappa,et al.  Statistical Computations on Grassmann and Stiefel Manifolds for Image and Video-Based Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  N. Higham The Scaling and Squaring Method for the Matrix Exponential Revisited , 2005, SIAM J. Matrix Anal. Appl..

[21]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[22]  Anuj Srivastava,et al.  Shape Analysis of Elastic Curves in Euclidean Spaces , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Cleve B. Moler,et al.  Nineteen Dubious Ways to Compute the Exponential of a Matrix, Twenty-Five Years Later , 1978, SIAM Rev..

[24]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[25]  Razvan Pascanu,et al.  Natural Neural Networks , 2015, NIPS.

[26]  Roberto Cipolla,et al.  Symmetry-invariant optimization in deep networks , 2015, ArXiv.

[27]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Bamdev Mishra,et al.  Manopt, a matlab toolbox for optimization on manifolds , 2013, J. Mach. Learn. Res..

[29]  Marc Pollefeys,et al.  The generalized trace-norm and its application to structure-from-motion problems , 2011, 2011 International Conference on Computer Vision.

[30]  Abhishek Bhattacharya,et al.  Nonparametric Inference on Manifolds: With Applications to Shape Spaces , 2015 .

[31]  Ruslan Salakhutdinov,et al.  Data-Dependent Path Normalization in Neural Networks , 2015, ICLR.

[32]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[33]  Huan Liu Feature Selection , 2010, Encyclopedia of Machine Learning.

[34]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[35]  T. Bengtsson,et al.  Estimation of high-dimensional prior and posterior covariance matrices in Kalman filter variants , 2007 .

[36]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[37]  Alan V. Oppenheim,et al.  Trading accuracy for numerical stability: Orthogonalization, biorthogonalization and regularization , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  H. V. Trees,et al.  Covariance, Subspace, and Intrinsic CramrRao Bounds , 2007 .

[42]  Sayan Mukherjee,et al.  The Information Geometry of Mirror Descent , 2013, IEEE Transactions on Information Theory.

[43]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[44]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[45]  Bernhard Schölkopf,et al.  Multivariate Regression via Stiefel Manifold Constraints , 2004, DAGM-Symposium.

[46]  Rémi Emonet,et al.  Metric Learning as Convex Combinations of Local Models with Generalization Guarantees , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[48]  Venu Govindaraju,et al.  Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks , 2016, ICML.

[49]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[50]  John M. Lee Introduction to Smooth Manifolds , 2002 .

[51]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[52]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[53]  Trevor Darrell,et al.  Data-dependent Initializations of Convolutional Neural Networks , 2015, ICLR.

[54]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[55]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[56]  Levent Tunçel,et al.  Optimization algorithms on matrix manifolds , 2009, Math. Comput..

[57]  Stefan Roth,et al.  Discriminative shape from shading in uncalibrated illumination , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Yui Man Lui,et al.  Advances in matrix manifolds for computer vision , 2012, Image Vis. Comput..

[59]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[60]  Bingbing Ni,et al.  HCP: A Flexible CNN Framework for Multi-Label Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Matthijs Douze,et al.  Large-scale image classification with trace-norm regularization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Timothy Doster,et al.  Gradual DropIn of Layers to Train Very Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Bamdev Mishra,et al.  Riemannian Preconditioning , 2014, SIAM J. Optim..

[64]  Trevor Darrell,et al.  Learning the Structure of Deep Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[65]  Tiejun Tong,et al.  Shrinkage‐based Diagonal Discriminant Analysis and Its Applications in High‐Dimensional Data , 2009, Biometrics.

[66]  Francis R. Bach,et al.  Trace Lasso: a trace norm regularization for correlated designs , 2011, NIPS.

[67]  D. Luenberger Optimization by Vector Space Methods , 1968 .

[68]  T. Mattfeldt Stochastic Geometry and Its Applications , 1996 .

[69]  Pierre-Antoine Absil,et al.  Joint Diagonalization on the Oblique Manifold for Independent Component Analysis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[70]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[71]  David M. Allen,et al.  The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction , 1974 .

[72]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[73]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[74]  Anuj Srivastava,et al.  Riemannian Computing in Computer Vision , 2015 .

[75]  Wotao Yin,et al.  A feasible method for optimization with orthogonality constraints , 2013, Math. Program..

[76]  Gene H. Golub,et al.  Matrix computations , 1983 .