Design Light-weight 3D Convolutional Networks for Video Recognition Temporal Residual, Fully Separable Block, and Fast Algorithm

Deep 3-dimensional (3D) Convolutional Network (ConvNet) has shown promising performance on video recognition tasks because of its powerful spatio-temporal information fusion ability. However, the extremely intensive requirements on memory access and computing power prohibit it from being used in resource-constrained scenarios, such as portable and edge devices. So in this paper, we first propose a two-stage Fully Separable Block (FSB) to significantly compress the model sizes of 3D ConvNets. Then a feature enhancement approach named Temporal Residual Gradient (TRG) is developed to improve the performance of compressed model on video tasks, which provides higher accuracy, faster convergency and better robustness. Moreover, in order to further decrease the computing workload, we propose a hybrid Fast Algorithm (hFA) to drastically reduce the computation complexity of convolutions. These methods are effectively combined to design a light-weight and efficient ConvNet for video recognition tasks. Experiments on the popular dataset report 2.3x compression rate, 3.6x workload reduction, and 6.3% top-1 accuracy gain, over the state-of-the-art SlowFast model, which is already a highly compact model. The proposed methods also show good adaptability on traditional 3D ConvNet, demonstrating 7.4x more compact model, 11.0x less workload, and 3.0% higher accuracy

[1]  Jeff Johnson,et al.  Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[2]  David Fernández Llorca,et al.  Vehicle logo recognition in traffic images using HOG features and SVM , 2013, 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013).

[3]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[4]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[5]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Zelong Wang,et al.  Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA , 2018, FPGA.

[8]  Zhongfeng Wang,et al.  Efficient convolution architectures for convolutional neural network , 2016, 2016 8th International Conference on Wireless Communications & Signal Processing (WCSP).

[9]  Zhongfeng Wang,et al.  Efficient Reconfigurable Hardware Core for Convolutional Neural Networks , 2018, 2018 52nd Asilomar Conference on Signals, Systems, and Computers.

[10]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[11]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Z. Mou,et al.  Fast FIR filtering: algorithms and implementations , 1987 .

[13]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[14]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[15]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[19]  Xuelong Li,et al.  Efficient HOG human detection , 2011, Signal Process..

[20]  Zhongfeng Wang,et al.  SGAD: Soft-Guided Adaptively-Dropped Neural Network , 2018, ArXiv.

[21]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[22]  Christian Wolf,et al.  Action Classification in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks , 2010, ICANN.

[23]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[24]  Yann LeCun,et al.  Fast Training of Convolutional Networks through FFTs , 2013, ICLR.

[25]  Xiaoyan Sun,et al.  MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[29]  S. Winograd Arithmetic complexity of computations , 1980 .

[30]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).