Compressed Video Action Recognition

Training robust deep video representations has proven to be much more challenging than learning deep image representations. This is in part due to the enormous size of raw video streams and the high temporal redundancy; the true and interesting signal is often drowned in too much irrelevant data. Motivated by that the superfluous information can be reduced by up to two orders of magnitude by video compression (using H.264, HEVC, etc.), we propose to train a deep network directly on the compressed video. This representation has a higher information density, and we found the training to be easier. In addition, the signals in a compressed video provide free, albeit noisy, motion information. We propose novel techniques to use them effectively. Our approach is about 4.6 times faster than Res3D and 2.7 times faster than ResNet-152. On the task of action recognition, our approach outperforms all the other methods on the UCF-101, HMDB-51, and Charades dataset.

[1]  Didier Le Gall,et al.  MPEG: a video compression standard for multimedia applications , 1991, CACM.

[2]  Boon-Lock Yeo,et al.  Rapid scene analysis on compressed video , 1995, IEEE Trans. Circuits Syst. Video Technol..

[3]  O. Sukmarg,et al.  Fast object detection and segmentation in MPEG compressed domain , 2000, 2000 TENCON Proceedings. Intelligent Systems and Technologies for the New Millennium (Cat. No.00CH37119).

[4]  Iain E. G. Richardson,et al.  Video Codec Design: Developing Image and Video Compression Systems , 2002 .

[5]  A. Enis Çetin,et al.  Moving object detection in wavelet compressed video , 2005, Signal Process. Image Commun..

[6]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[7]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[8]  Jan-Michael Frahm,et al.  Detailed Real-Time Urban 3D Reconstruction from Video , 2007, International Journal of Computer Vision.

[9]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[11]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[12]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[13]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[14]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[15]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[16]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[18]  Ivan Laptev,et al.  Efficient Feature Extraction, Encoding, and Classification for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[20]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[22]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[27]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[29]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[30]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[34]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[35]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[36]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[38]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[39]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Luc Van Gool,et al.  Deep Temporal Linear Encoding Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[44]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[45]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Wei Zeng,et al.  Learning Long-Term Dependencies for Action Recognition with a Biologically-Inspired Deep Network , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Ali Farhadi,et al.  Asynchronous Temporal Fields for Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Philip S. Yu,et al.  Spatiotemporal Pyramid Network for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Lin Sun,et al.  Lattice Long Short-Term Memory for Human Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Ghassan Al-Regib,et al.  TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition , 2017, Signal Process. Image Commun..