ADCI-Net: an adaptive discriminative clip identification strategy for fast video action recognition

Abstract. The most common method for video-level classification is to operate models over a whole series of fixed-temporal-length clips in a long video. However, a video usually consists of several meaningless sections, and similar clips that are fed into models could lead to a suboptimal recognition accuracy and waste of computing resources. To address this issue, we introduced an adaptive discriminative clip identification network to evaluate every video clip with respect to its relevance. We adaptively choose top-ranked clips as inputs for prediction and filter out irrelevant clips. Specifically, for a given trained cumbersome convolutional neural network action recognition model, we use a lightweight hallucination network (H-Net) to study its generalization ability based on distillation. Then, the evaluation of relevant clips is performed by considering the imitated features of H-Net. Thus, heavy calculations of overlapping content or meaningless clips can be avoided during inferences. We validate our approach by examining its performance on two datasets: UCF101 and HMDB51. Our strategy can be applied to any clip-based action recognition model. Experimental results demonstrate that on UCF101, we reduced the computational cost by 70% and increased the accuracy.

[1]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Sachin Chaudhary,et al.  Deep network for human action recognition using Weber motion , 2019, Neurocomputing.

[4]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[7]  Changsheng Li,et al.  AdapNet: Adaptability Decomposing Encoder–Decoder Network for Weakly Supervised Action Recognition and Localization , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[8]  Lei Zhang,et al.  AutoLoc: Weakly-supervised Temporal Action Localization , 2018, ECCV.

[9]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[11]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Xudong Lin,et al.  DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Limin Wang,et al.  Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Bernhard Schölkopf,et al.  Unifying distillation and privileged information , 2015, ICLR.

[17]  Alexander J. Smola,et al.  Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Bin Kang,et al.  TEA: Temporal Excitation and Aggregation for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yutaka Satoh,et al.  Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs? , 2020, ArXiv.

[20]  Trevor Darrell,et al.  Learning with Side Information through Modality Hallucination , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[22]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[24]  Yepeng Guan,et al.  Trajectory-aware three-stream CNN for video action recognition , 2018, J. Electronic Imaging.

[25]  Rauf Izmailov,et al.  Learning using privileged information: similarity control and knowledge transfer , 2015, J. Mach. Learn. Res..

[26]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Bowen Zhang,et al.  Real-Time Action Recognition With Deeply Transferred Motion Vector CNNs , 2018, IEEE Transactions on Image Processing.

[28]  Yazan Abu Farha,et al.  MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[32]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[33]  Sinan Kalkan,et al.  Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition , 2020, ECCV Workshops.

[34]  Heng Wang,et al.  Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Wu Liu,et al.  A Real-Time Action Representation With Temporal Encoding and Deep Compression , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[37]  Lorenzo Torresani,et al.  SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Cordelia Schmid,et al.  MARS: Motion-Augmented RGB Stream for Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[40]  Vittorio Murino,et al.  Modality Distillation with Multiple Stream Networks for Action Recognition , 2018, ECCV.

[41]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.