A Slow-I-Fast-P Architecture for Compressed Video Action Recognition

Compressed video action recognition has drawn growing attention for the storage and processing advantages of compressed videos over original raw videos. While the past few years have witnessed remarkable progress in this problem, most existing approaches rely on RGB frames from raw videos and require multi-step training. In this paper, we propose a novel Slow-I-Fast-P (SIFP) neural network model for compressed video action recognition. It consists of the slow I pathway receiving a sparse sampling I-frame clip and the fast P pathway receiving a dense sampling pseudo optical flow clip. An unsupervised estimation method and a new loss function are designed to generate pseudo optical flows in compressed videos. Our model eliminates the dependence on the traditional optical flows calculated from raw videos. The model is trained in an end-to-end way. The proposed method is evaluated on the challenging HMDB51 and UCF101 datasets. The extensive comparison results and ablation studies demonstrate the effectiveness and strength of the proposed method.

[1]  J. Weickert,et al.  Lucas/Kanade meets Horn/Schunck: combining local and global optic flow methods , 2005 .

[2]  Nanning Zheng,et al.  Learning Composite Latent Structures for 3D Human Action Representation and Recognition , 2019, IEEE Transactions on Multimedia.

[3]  Yi Yang,et al.  Occlusion Aware Unsupervised Learning of Optical Flow , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Konstantinos G. Derpanis,et al.  Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness , 2016, ECCV Workshops.

[5]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Didier Le Gall,et al.  MPEG: a video compression standard for multimedia applications , 1991, CACM.

[7]  Cordelia Schmid,et al.  MARS: Motion-Augmented RGB Stream for Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Chuang Gan,et al.  End-to-End Learning of Motion Representation for Video Understanding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[10]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[12]  Joachim Weickert,et al.  Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods , 2005, International Journal of Computer Vision.

[13]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  P. Lennie,et al.  Spatial and temporal contrast sensitivities of neurones in lateral geniculate nucleus of macaque. , 1984, The Journal of physiology.

[15]  D. V. Essen,et al.  Neural mechanisms of form and motion processing in the primate visual system , 1994, Neuron.

[16]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[17]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  D. Hubel,et al.  Segregation of form, color, movement, and depth: anatomy, physiology, and perception. , 1988, Science.

[22]  Michael J. Black,et al.  Learning Optical Flow , 2008, ECCV.

[23]  D. J. Felleman,et al.  Distributed hierarchical processing in the primate cerebral cortex. , 1991, Cerebral cortex.

[24]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[25]  Christian Heipke,et al.  Joint 3d Estimation of Vehicles and Scene Flow , 2015 .

[26]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[27]  Stefan Roth,et al.  UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss , 2017, AAAI.

[28]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[29]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Thomas Brox,et al.  ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[32]  Kristen Grauman,et al.  Im2Flow: Motion Hallucination from Static Images for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[34]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[35]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Bowen Zhang,et al.  Real-Time Action Recognition With Deeply Transferred Motion Vector CNNs , 2018, IEEE Transactions on Image Processing.

[37]  D H HUBEL,et al.  RECEPTIVE FIELDS AND FUNCTIONAL ARCHITECTURE IN TWO NONSTRIATE VISUAL AREAS (18 AND 19) OF THE CAT. , 1965, Journal of neurophysiology.

[38]  Alexander J. Smola,et al.  Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Xudong Lin,et al.  DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Michael J. Black,et al.  On the Integration of Optical Flow and Action Recognition , 2017, GCPR.

[42]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[43]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Nanning Zheng,et al.  Spatiotemporal neural networks for action recognition based on joint loss , 2019, Neural Computing and Applications.

[45]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Shuicheng Yan,et al.  Multi-Fiber Networks for Video Recognition , 2018, ECCV.

[47]  Michael J. Black,et al.  Optical Flow Estimation Using a Spatial Pyramid Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Nanning Zheng,et al.  Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[51]  Nojun Kwak,et al.  Motion Feature Network: Fixed Motion Filter for Action Recognition , 2018, ECCV.

[52]  Michael J. Black,et al.  A Naturalistic Open Source Movie for Optical Flow Evaluation , 2012, ECCV.

[53]  Yi Zhu,et al.  Hidden Two-Stream Convolutional Networks for Action Recognition , 2017, ACCV.

[54]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Jonghyun Choi,et al.  ActionFlowNet: Learning Motion Representation for Action Recognition , 2016, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).