Regularizing Neural Networks via Stochastic Branch Layers

We introduce a novel stochastic regularization technique for deep neural networks, which decomposes a layer into multiple branches with different parameters and merges stochastically sampled combinations of the outputs from the branches during training. Since the factorized branches can collapse into a single branch through a linear operation, inference requires no additional complexity compared to the ordinary layers. The proposed regularization method, referred to as StochasticBranch, is applicable to any linear layers such as fully-connected or convolution layers. The proposed regularizer allows the model to explore diverse regions of the model parameter space via multiple combinations of branches to find better local minima. An extensive set of experiments shows that our method effectively regularizes networks and further improves the generalization performance when used together with other existing regularization techniques.

[1]  Yaoliang Yu,et al.  Dropout with Expectation-linear Regularization , 2016, ICLR.

[2]  Huan Wang,et al.  Adaptive Dropout with Rademacher Complexity Regularization , 2018, ICLR.

[3]  Dacheng Tao,et al.  Shakeout: A New Regularized Deep Neural Network Training Scheme , 2016, AAAI.

[4]  Pascal Vincent,et al.  Dropout as data augmentation , 2015, ArXiv.

[5]  Bohyung Han,et al.  Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization , 2017, NIPS.

[6]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[7]  Xiang Li,et al.  Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[9]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Brendan J. Frey,et al.  Adaptive dropout for training deep neural networks , 2013, NIPS.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[13]  Michael Cogswell,et al.  Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks , 2015, ArXiv.

[14]  Bohyung Han,et al.  BranchOut: Regularization for Online Ensemble Tracking with Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Adi Shraibman,et al.  Rank, Trace-Norm and Max-Norm , 2005, COLT.

[16]  Rob Fergus,et al.  Stochastic Pooling for Regularization of Deep Convolutional Neural Networks , 2013, ICLR.

[17]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[18]  Timothy Doster,et al.  Gradual DropIn of Layers to Train Very Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[20]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[21]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[22]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Tianbao Yang,et al.  Improved Dropout for Shallow and Deep Learning , 2016, NIPS.

[25]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[26]  Lorenzo Porzi,et al.  Dropout distillation , 2016, ICML.

[27]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Pierre Baldi,et al.  The dropout learning algorithm , 2014, Artif. Intell..

[29]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[30]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[31]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.