论文信息 - Avoiding Degradation in Deep Feed-Forward Networks by Phasing Out Skip-Connections

Avoiding Degradation in Deep Feed-Forward Networks by Phasing Out Skip-Connections

A widely observed phenomenon in deep learning is the degradation problem: increasing the depth of a network leads to a decrease in performance on both test and training data. Novel architectures such as ResNets and Highway networks have addressed this issue by introducing various flavors of skip-connections or gating mechanisms. However, the degradation problem persists in the context of plain feed-forward networks. In this work we propose a simple method to address this issue. The proposed method poses the learning of weights in deep networks as a constrained optimization problem where the presence of skip-connections is penalized by Lagrange multipliers. This allows for skip-connections to be introduced during the early stages of training and subsequently phased out in a principled manner. We demonstrate the benefits of such an approach with experiments on MNIST, fashion-MNIST, CIFAR-10 and CIFAR-100 where the proposed method is shown to greatly decrease the degradation effect (compared to plain networks) and is often competitive with ResNets.

[1] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[2] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[3] Yoshua. Bengio,et al. Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[4] Yoshua Bengio,et al. FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[5] Brian McWilliams,et al. The Shattered Gradients Problem: If resnets are the answer, then what is the question? , 2017, ICML.

[6] Roland Vollgraf,et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[7] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8] Nicol N. Schraudolph,et al. Centering Neural Network Gradient Factors , 1996, Neural Networks: Tricks of the Trade.

[9] Jürgen Schmidhuber,et al. Highway and Residual Networks learn Unrolled Iterative Estimation , 2016, ICLR.

[10] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[11] Sepp Hochreiter,et al. Self-Normalizing Neural Networks , 2017, NIPS.

[12] Zhuowen Tu,et al. Deeply-Supervised Nets , 2014, AISTATS.

[13] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[14] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[15] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[16] Jürgen Schmidhuber,et al. Training Very Deep Networks , 2015, NIPS.

[17] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] D. Hubel,et al. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[19] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.

[20] Nikos Komodakis,et al. DiracNets: Training Very Deep Neural Networks Without Skip-Connections , 2017, ArXiv.

[21] Tapani Raiko,et al. Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[22] Sepp Hochreiter,et al. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[23] Serge J. Belongie,et al. Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[24] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Pascal Vincent,et al. Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Yoshua Bengio,et al. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[27] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.