ResNet and Batch-normalization Improve Data Separability

The skip-connection and the batch-normalization (BN) in ResNet enable an extreme deep neural network to be trained with high performance. However, the reasons for its high performance are still unclear. To clear that, we study the effects of the skip-connection and the BN on the class-related signal propagation through hidden layers because a large ratio of the between-class distance to the within-class distance of feature vectors at the last hidden layer induces high performance. Our result shows that the between-class distance and the within-class distance change differently through layers: the deep multilayer perceptron with randomly initialized weights degrades the ratio of the between-class distance to the within-class distance and the skip-connection and the BN relax this degradation. Moreover, our analysis implies that the skip-connection and the BN encourage training to improve this distance ratio. These results imply that the skip-connection and the BN induce high performance.

[1]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[2]  Guillermo Sapiro,et al.  Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? , 2015, IEEE Transactions on Signal Processing.

[3]  Xing Ji,et al.  CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[5]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[6]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Alper Gungor,et al.  Comments on “Deep Neural Networks With Random Gaussian Weights: A Universal Classification Strategy?” , 2019, IEEE Transactions on Signal Processing.

[9]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[10]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[11]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[12]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[13]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[14]  Lior Wolf,et al.  Learning over Sets using Kernel Principal Angles , 2003, J. Mach. Learn. Res..

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).