Two-Stream Collaborative Learning With Spatial-Temporal Attention for Video Classification

Video classification is highly important and has widespread applications, such as video search and intelligent surveillance. Video naturally contains both static and motion information, which can be represented by frames and optical flow, respectively. Recently, researchers have generally adopted deep networks to capture the static and motion information separately, which has two main limitations. First, the coexistence relationship between spatial and temporal attention is ignored, although they should be jointly modeled as the spatial and temporal evolutions of video to learn discriminative video features. Second, the strong complementarity between static and motion information is ignored, although they should be collaboratively learned to enhance each other. To address the above two limitations, this paper proposes the two-stream collaborative learning with spatial-temporal attention (TCLSTA) approach, which consists of two models. First, for the spatial-temporal attention model, the spatial-level attention emphasizes the salient regions in a frame, and the temporal-level attention exploits the discriminative frames in a video. They are mutually enhanced to jointly learn the discriminative static and motion features for better classification performance. Second, for the static-motion collaborative model, it not only achieves mutual guidance between static and motion information to enhance the feature learning but also adaptively learns the fusion weights of static and motion streams, thus exploiting the strong complementarity between static and motion information to improve video classification. Experiments on four widely used data sets show that our TCLSTA approach achieves the best performance compared with more than 10 state-of-the-art methods.

[1]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yizhou Yu,et al.  Visual Saliency Detection Based on Multiscale Deep CNN Features , 2016, IEEE Transactions on Image Processing.

[3]  Limin Wang,et al.  Multi-view Super Vector for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Cordelia Schmid,et al.  A Robust and Efficient Video Representation for Action Recognition , 2015, International Journal of Computer Vision.

[6]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[7]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[8]  Zhi-Hua Zhou,et al.  Deep Learning for Fixed Model Reuse , 2017, AAAI.

[9]  Xiaodong Yang,et al.  Evaluation of Low-Level Features for Real-World Surveillance Event Detection , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Rongrong Ji,et al.  Exploring Coherent Motion Patterns via Structured Trajectory Learning for Crowd Mood Modeling , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[11]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  James M. Rehg,et al.  Movement Pattern Histogram for Action Recognition and Retrieval , 2014, ECCV.

[13]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[14]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Thomas Mauthner,et al.  Encoding based saliency detection for videos and images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ming-yu Chen,et al.  Long Term Activity Analysis in Surveillance Video Archives , 2010 .

[19]  Nuno Vasconcelos,et al.  VLAD3: Encoding Dynamics of Deep Features for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Alexander Zien,et al.  lp-Norm Multiple Kernel Learning , 2011, J. Mach. Learn. Res..

[21]  Shuicheng Yan,et al.  Hybrid CNN and Dictionary-Based Models for Scene Recognition and Domain Adaptation , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[22]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Junsong Yuan,et al.  Discovering Primary Objects in Videos by Saliency Fusion and Iterative Appearance Estimation , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Ramakant Nevatia,et al.  ACTIVE: Activity Concept Transitions in Video Event Classification , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  K. R. Ramakrishnan,et al.  A Cause and Effect Analysis of Motion Trajectories for Modeling Actions , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Ali Farhadi,et al.  Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ling Shao,et al.  Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition , 2013, Pattern Recognit..

[28]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jonathan Tompson,et al.  Unsupervised Learning of Spatiotemporally Coherent Metrics , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ling Shao,et al.  Consistent Video Saliency Using Local Gradient Flow Optimization and Global Refinement , 2015, IEEE Transactions on Image Processing.

[32]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[33]  Albert Ali Salah,et al.  Efficient large-scale action recognition in videos using extreme learning machines , 2015, Expert Syst. Appl..

[34]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[35]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36]  Wen Gao,et al.  Deep Alternative Neural Network: Exploring Contexts as Early as Possible for Action Recognition , 2016, NIPS.

[37]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, Comb..

[39]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[40]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[41]  Larry S. Davis,et al.  Multi-Task Learning with Low Rank Attribute Embedding for Person Re-Identification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Andrew Zisserman,et al.  Multiple kernels for object detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[43]  Jianxin Wu,et al.  Towards Good Practices for Action Video Encoding , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Cees Snoek,et al.  What do 15,000 object categories tell us about classifying and localizing actions? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[46]  Bhiksha Raj,et al.  Beyond Gaussian Pyramid: Multi-skip Feature Stacking for action recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[49]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Ahmed M. Elgammal,et al.  Information Theoretic Key Frame Selection for Action Recognition , 2008, BMVC.

[52]  P. Bartlett,et al.  ` p-Norm Multiple Kernel Learning , 2008 .

[53]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Gang Sun,et al.  A Key Volume Mining Deep Framework for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[56]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Mubarak Shah,et al.  Recognizing Complex Events Using Large Margin Joint Low-Level Event Model , 2012, ECCV.

[59]  Limin Wang,et al.  Motionlets: Mid-level 3D Parts for Human Motion Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  M. Kloft,et al.  l p -Norm Multiple Kernel Learning , 2011 .

[61]  Quoc Cuong Pham,et al.  Crowd Behavior Analysis Using Local Mid-Level Visual Descriptors , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[62]  Dong Wang,et al.  Collaborative Joint Training With Multitask Recurrent Model for Speech and Speaker Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[63]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[64]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[65]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[66]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[67]  Gang Wang,et al.  Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[69]  Jeffrey Mark Siskind,et al.  Action Recognition by Time Series of Retinotopic Appearance and Motion Features , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[70]  Shiguang Shan,et al.  Modeling Video Dynamics with Deep Dynencoder , 2014, ECCV.

[71]  Limin Wang,et al.  Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .

[72]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[73]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[74]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[75]  Limin Wang,et al.  Mining Motion Atoms and Phrases for Complex Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[76]  Xi Wang,et al.  Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification , 2016, ACM Multimedia.

[77]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[78]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[79]  Mehrtash Tafazzoli Harandi,et al.  Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[80]  Qi Tian,et al.  Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[81]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.