Rate-Accuracy Trade-Off in Video Classification With Deep Convolutional Neural Networks

Advanced video classification systems decode video frames to derive texture and motion representations for ingestion and analysis by spatio–temporal deep convolutional neural networks (CNNs). However, when considering visual Internet-of-Things applications, surveillance systems, and semantic crawlers of large video repositories, the video capture and the CNN-based semantic analysis parts do not tend to be co-located. This necessitates the transport of compressed video over networks and incurs significant overhead in bandwidth and energy consumption, thereby significantly undermining the deployment potential of such systems. In this paper, we investigate the trade-off between the encoding bitrate and the achievable accuracy of CNN-based video classification models that directly ingest AVC/H.264 and HEVC encoded videos. Instead of retaining entire compressed video bitstreams and applying complex optical flow calculations prior to CNN processing, we only retain motion vector and select texture information at significantly reduced bitrates and apply no additional processing prior to CNN ingestion. Based on three CNN architectures and two action recognition datasets, we achieve 11%–94% savings in bitrate with marginal effect on classification accuracy. A model-based selection between multiple CNNs increases these savings further to the point where, if up to 7% loss of accuracy can be tolerated, video classification can take place with as little as 3 kb/s for the transport of the required compressed video information to the system implementing the CNN models.

[1]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Michael J. Black,et al.  On the Integration of Optical Flow and Action Recognition , 2017, GCPR.

[3]  Gang Yu,et al.  Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Marco Tagliasacchi,et al.  Rate-accuracy optimization in visual wireless sensor networks , 2012, 2012 19th IEEE International Conference on Image Processing.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Yiannis Andreopoulos,et al.  Video Classification With CNNs: Using the Codec as a Spatio-Temporal Activity Sensor , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[7]  Alexander J. Smola,et al.  Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[9]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[10]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Marco Tagliasacchi,et al.  Compress-then-analyze vs. analyze-then-compress: Two paradigms for image analysis in visual sensor networks , 2013, 2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP).

[13]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Do-Kyoung Kwon,et al.  Rate Control for H.264 Video With Enhanced Rate and Distortion Models , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[15]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[16]  Xinghao Jiang,et al.  Two-Stream Dictionary Learning Architecture for Action Recognition , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[17]  Yiannis Andreopoulos,et al.  Rate-Accuracy Trade-Off in Video Classification with Deep Convolutional Neural Networks , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[18]  Nicholas I. M. Gould,et al.  A globally convergent Lagrangian barrier algorithm for optimization with general inequality constraints and simple bounds , 1997, Math. Comput..

[19]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[22]  Bowen Zhang,et al.  Real-Time Action Recognition With Deeply Transferred Motion Vector CNNs , 2018, IEEE Transactions on Image Processing.

[23]  Yuxin Peng,et al.  Visual Data Synthesis via GAN for Zero-Shot Video Classification , 2018, IJCAI.

[24]  Yuxin Peng,et al.  Better and Faster: Knowledge Transfer from Multiple Self-supervised Learning Tasks via Graph Distillation for Video Classification , 2018, IJCAI.

[25]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[27]  Gary J. Sullivan,et al.  Overview of the High Efficiency Video Coding (HEVC) Standard , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[28]  Didier Stricker,et al.  Visual Computing as a Key Enabling Technology for Industrie 4.0 and Industrial Internet , 2015, IEEE Computer Graphics and Applications.

[29]  Qi Tian,et al.  Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[30]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Yuxin Peng,et al.  Two-Stream Collaborative Learning With Spatial-Temporal Attention for Video Classification , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[33]  Ivan Laptev,et al.  Efficient Feature Extraction, Encoding, and Classification for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[35]  Ajay Luthra,et al.  Overview of the H.264/AVC video coding standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[36]  Jitendra Malik,et al.  Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.