Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a frame-by-frame basis, which are not applicable to many video analytic tasks where spatio-temporal features are prevailing. In this paper we propose a novel self-supervised approach to learn spatio-temporal features for video representation. Inspired by the success of two-stream approaches in video classification, we propose to learn visual features by regressing both motion and appearance statistics along spatial and temporal dimensions, given only the input video data. Specifically, we extract statistical concepts (fast-motion region and the corresponding dominant direction, spatio-temporal color diversity, dominant color, etc.) from simple patterns in both spatial and temporal domains. Unlike prior puzzles that are even hard for humans to solve, the proposed approach is consistent with human inherent visual habits and therefore easy to answer. We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach. The experiments show that our approach can significantly improve the performance of C3D when applied to video classification tasks. Code is available at https://github.com/laura-wang/video_repres_mas.

[1]  Tal Hassner,et al.  The Action Similarity Labeling Challenge , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Wei Liu,et al.  Reconstruction Network for Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Abhinav Gupta,et al.  Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[6]  Ming-Hsuan Yang,et al.  Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Chong-Wah Ngo,et al.  Human Action Recognition in Unconstrained Videos by Explicit Motion Modeling , 2015, IEEE Transactions on Image Processing.

[8]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Paolo Favaro,et al.  Representation Learning by Learning to Count , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Shih-Fu Chang,et al.  CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[13]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[14]  T. Poggio,et al.  Cognitive neuroscience: Neural mechanisms for the recognition of biological movements , 2003, Nature Reviews Neuroscience.

[15]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[16]  In-So Kweon,et al.  Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles , 2018, AAAI.

[17]  Richard P. Wildes,et al.  Dynamic scene understanding: The role of orientation features in space and time in scene classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Qi Tian,et al.  Mining spatiotemporal video patterns towards robust action retrieval , 2013, Neurocomputing.

[19]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[20]  Leonidas J. Guibas,et al.  Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[22]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[23]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[24]  Trevor Darrell,et al.  Learning Features by Watching Objects Move , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Tao Mei,et al.  Jointly Localizing and Describing Events for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Wei Liu,et al.  Computer Vision and Image Understanding Video Classification via Weakly Supervised Sequence Modeling , 2022 .

[29]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[30]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[31]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[33]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[34]  Wei Liu,et al.  Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Richard P. Wildes,et al.  Spacetime Forests with Complementary Features for Dynamic Scene Recognition , 2013, BMVC.

[36]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[40]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[42]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[44]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[45]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.