论文信息 - Deep networks with probabilistic gates

Deep networks with probabilistic gates

We investigate learning to probabilistically bypass computations in a network architecture. Our approach is motivated by AIG [43], where layers are conditionally executed depending on their inputs, and the network is trained against a target bypass rate using a per-layer loss. We propose a per-batch loss function, and describe strategies for handling probabilistic bypass during inference as well as training. Per-batch loss allows the network additional flexibility. In particular, a form of mode collapse becomes plausible, where some layers are nearly always bypassed and some almost never; such a configuration is strongly discouraged by AIG’s per-layer loss. We explore several inference-time strategies, including the natural MAP approach. With data-dependent bypass, we demonstrate improved performance over AIG. With data-independent bypass, as in stochastic depth [18], we observe mode collapse and effectively prune layers. We demonstrate our techniques on ResNet-50 and ResNet-101 [11] for ImageNet [3], where our techniques produce improved accuracy (.15–.41% in precision@1) with substantially less computation (bypassing 25–40% of the layers).

[1] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[2] Mark Sandler,et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3] Gang Hua,et al. A convolutional neural network cascade for face detection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] R. Venkatesh Babu,et al. Learning Neural Network Architectures using Backpropagation , 2015, BMVC.

[6] Leonidas J. Guibas,et al. Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7] Serge Belongie,et al. Convolutional Networks with Adaptive Inference Graphs , 2019, International Journal of Computer Vision.

[8] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[9] E. Gumbel. Statistical Theory of Extreme Values and Some Practical Applications : A Series of Lectures , 1954 .

[10] Pietro Perona,et al. Deciding How to Decide: Dynamic Routing in Artificial Neural Networks , 2017, ICML.

[11] Ariel D. Procaccia,et al. Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[12] Li Zhang,et al. Spatially Adaptive Computation Time for Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[14] Yoshua Bengio,et al. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[15] Yoshua Bengio,et al. Deep Learning of Representations: Looking Forward , 2013, SLSP.

[16] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[17] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[18] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19] R. Venkatesh Babu,et al. Training Sparse Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[20] Max Welling,et al. Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[21] Luca Zappella,et al. Principal Filter Analysis for Guided Network Compression , 2018, ArXiv.

[22] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[23] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.

[24] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[25] Hanan Samet,et al. Pruning Filters for Efficient ConvNets , 2016, ICLR.

[26] Fan Yang,et al. Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Rui Peng,et al. Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures , 2016, ArXiv.

[28] Alex Graves,et al. Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[29] Yann LeCun,et al. Optimal Brain Damage , 1989, NIPS.

[30] Xiangyu Zhang,et al. Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31] R. Venkatesh Babu,et al. Generalized Dropout , 2016, ArXiv.

[32] Jianxin Wu,et al. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33] Paul A. Viola,et al. Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[34] Timo Aila,et al. Pruning Convolutional Neural Networks for Resource Efficient Inference , 2016, ICLR.

[35] Alex Kendall,et al. Concrete Dropout , 2017, NIPS.

[36] Dmitry P. Vetrov,et al. Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[37] Babak Hassibi,et al. Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[38] Kilian Q. Weinberger,et al. Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.

[39] Michael C. Mozer,et al. Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[40] Mingjie Sun,et al. Rethinking the Value of Network Pruning , 2018, ICLR.

[41] Yee Whye Teh,et al. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[42] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[43] H. T. Kung,et al. BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[44] Naiyan Wang,et al. Data-Driven Sparse Structure Selection for Deep Neural Networks , 2017, ECCV.