Efficient Video Classification Using Fewer Frames

Recently, there has been a lot of interest in building compact models for video classification which have a small memory footprint (<1 GB). While these models are compact, they typically operate by repeated application of a small weight matrix to all the frames in a video. For example, recurrent neural network based methods compute a hidden state for every frame of the video using a recurrent weight matrix. Similarly, cluster-and-aggregate based methods such as NetVLAD have a learnable clustering matrix which is used to assign soft-clusters to every frame in the video. Since these models look at every frame in the video, the number of floating point operations (FLOPs) is still large even though the memory footprint is small. In this work, we focus on building compute-efficient video classification models which process fewer frames and hence have less number of FLOPs. Similar to memory efficient models, we use the idea of distillation albeit in a different setting. Specifically, in our case, a compute-heavy teacher which looks at all the frames in the video is used to train a compute-efficient student which looks at only a small fraction of frames in the video. This is in contrast to a typical memory efficient Teacher-Student setting, wherein both the teacher and the student look at all the frames in the video but the student has fewer parameters. Our work thus complements the research on memory efficient video classification. We do an extensive evaluation with three types of models for video classification, viz., (i) recurrent models (ii) cluster-and-aggregate models and (iii) memory-efficient cluster-and-aggregate models and show that in each of these cases, a see-it-all teacher can be used to train a compute efficient see-very-little student. Overall, we show that the proposed student network can reduce the inference time by 30% and the number of FLOPs by approximately 90% with a negligent drop in the performance.

[1]  Krista A. Ehinger,et al.  SUN Database: Exploring a Large Collection of Scene Categories , 2014, International Journal of Computer Vision.

[2]  Bo Liu,et al.  Constrained-size Tensorflow Models for YouTube-8M Video Understanding Challenge , 2018, ECCV Workshops.

[3]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jianping Fan,et al.  NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification , 2018, ECCV Workshops.

[5]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[6]  Xi Wang,et al.  Aggregating Frame-level Features for Large-Scale Video Classification , 2017, ArXiv.

[7]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Erich Elsen,et al.  Exploring Sparsity in Recurrent Neural Networks , 2017, ICLR.

[9]  Bernhard Schölkopf,et al.  Unifying distillation and privileged information , 2015, ICLR.

[10]  Ji Wu,et al.  The Monkeytyping Solution to the YouTube-8M Video Understanding Challenge , 2017, ArXiv.

[11]  Miha Skalic,et al.  Deep Learning Methods for Efficient Large Scale Video Labeling , 2017, ArXiv.

[12]  Xiao Liu,et al.  Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding , 2017, ArXiv.

[13]  Sergey I. Nikolenko,et al.  Label Denoising with Large Ensembles of Heterogeneous Neural Networks , 2018, ECCV Workshops.

[14]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[17]  Shivam Garg,et al.  Learning Video Features for Multi-label Classification , 2018, ECCV Workshops.

[18]  Xin Jin,et al.  VideoSet: A large-scale compressed video quality dataset based on JND measurement , 2017, J. Vis. Commun. Image Represent..

[19]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[20]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[21]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[22]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[26]  Sebastian Kmiec,et al.  Learnable Pooling Methods for Video Classification , 2018, ECCV Workshops.

[27]  Qingming Huang,et al.  Less Is More: Picking Informative Frames for Video Captioning , 2018, ECCV.

[28]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[29]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[30]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[31]  Yi Yang,et al.  Watching a Small Portion could be as Good as Watching All: Towards Efficient Video Classification , 2018, IJCAI.

[32]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[33]  Kyoung-Woon On,et al.  Temporal Attention Mechanism with Conditional Inference for Large-Scale Multi-label Video Classification , 2018, ECCV Workshops.

[34]  Mark J. F. Gales,et al.  Sequence Student-Teacher Training of Deep Neural Networks , 2016, INTERSPEECH.

[35]  David Austin,et al.  Building A Size Constrained Predictive Models for Video Classification , 2018, ECCV Workshops.

[36]  Rahul Sukthankar,et al.  The 2nd YouTube-8M Large-Scale Video Understanding Challenge , 2018, ECCV Workshops.

[37]  Xu Lan,et al.  Knowledge Distillation by On-the-Fly Native Ensemble , 2018, NeurIPS.

[38]  Nan Yang,et al.  Attention-Guided Answer Distillation for Machine Reading Comprehension , 2018, EMNLP.

[39]  Tony X. Han,et al.  Learning Efficient Object Detection Models with Knowledge Distillation , 2017, NIPS.

[40]  Balaraman Ravindran,et al.  Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[41]  Xing Zhang,et al.  Non-local NetVLAD Encoding for Video Classification , 2018, ECCV Workshops.