Avoiding Degradation in Deep Feed-Forward Networks by Phasing Out Skip-Connections

A widely observed phenomenon in deep learning is the degradation problem: increasing the depth of a network leads to a decrease in performance on both test and training data. Novel architectures such as ResNets and Highway networks have addressed this issue by introducing various flavors of skip-connections or gating mechanisms. However, the degradation problem persists in the context of plain feed-forward networks. In this work we propose a simple method to address this issue. The proposed method poses the learning of weights in deep networks as a constrained optimization problem where the presence of skip-connections is penalized by Lagrange multipliers. This allows for skip-connections to be introduced during the early stages of training and subsequently phased out in a principled manner. We demonstrate the benefits of such an approach with experiments on MNIST, fashion-MNIST, CIFAR-10 and CIFAR-100 where the proposed method is shown to greatly decrease the degradation effect (compared to plain networks) and is often competitive with ResNets.

[1]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[2]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[3]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[4]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[5]  Brian McWilliams,et al.  The Shattered Gradients Problem: If resnets are the answer, then what is the question? , 2017, ICML.

[6]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[7]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Nicol N. Schraudolph,et al.  Centering Neural Network Gradient Factors , 1996, Neural Networks: Tricks of the Trade.

[9]  Jürgen Schmidhuber,et al.  Highway and Residual Networks learn Unrolled Iterative Estimation , 2016, ICLR.

[10]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[11]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[12]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[13]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[14]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[15]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[16]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[19]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[20]  Nikos Komodakis,et al.  DiracNets: Training Very Deep Neural Networks Without Skip-Connections , 2017, ArXiv.

[21]  Tapani Raiko,et al.  Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[22]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[23]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[24]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[27]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.