Learning Spatio-temporal Representation by Channel Aliasing Video Perception

In this paper, we propose a novel pretext task namely Channel Aliasing Video Perception (CAVP) for self-supervised video representation learning. The main idea of our approach is to generate channel aliasing videos, which carry different motion cues simultaneously by assembling distinct channels from different videos. With the generated channel aliasing videos, we propose to recognize the number of different motion flows within a channel aliasing video for perception of discriminative motion cues. As a plug-and-play method, the proposed pretext task can be integrated into a co-training framework with other self-supervised learning methods to further improve the performance. Experimental results on publicly available action recognition benchmarks verify the effectiveness of our method for spatio-temporal representation learning.

[1]  Andrew Zisserman,et al.  Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Ke Li,et al.  Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion , 2020, AAAI.

[3]  Sergio Guadarrama,et al.  Tracking Emerges by Colorizing Videos , 2018, ECCV.

[4]  Yueting Zhuang,et al.  Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jianbo Jiao,et al.  Self-supervised Video Representation Learning by Pace Prediction , 2020, ECCV.

[7]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[8]  Wei Liu,et al.  Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[10]  Heng Wang,et al.  Video Modeling With Correlation Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[12]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[13]  Alan Yuille,et al.  Grouped Spatial-Temporal Aggregation for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Allan Jabri,et al.  Learning Correspondence From the Cycle-Consistency of Time , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Toshihiko Yamasaki,et al.  Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework , 2020, ACM Multimedia.

[17]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[18]  Runhao Zeng,et al.  Graph Convolutional Networks for Temporal Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  In-So Kweon,et al.  Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles , 2018, AAAI.

[20]  Andrew Zisserman,et al.  Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[21]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[22]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Shih-Fu Chang,et al.  Unsupervised Embedding Learning via Invariant and Spreading Instance Feature , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Andrea Vedaldi,et al.  Self-labelling via simultaneous clustering and representation learning , 2020, ICLR.

[26]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[27]  Jiebo Luo,et al.  AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations Rather Than Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Luc Van Gool,et al.  DynamoNet: Dynamic Action and Motion Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Yu Zhou,et al.  Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Shanmuganathan Raman,et al.  Attentive Spatio-Temporal Representation Learning for Diving Classification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[33]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[34]  Jianbo Shi,et al.  Learning Discriminative Motion Features Through Detection , 2018, ArXiv.

[35]  Paolo Favaro,et al.  Video Representation Learning by Recognizing Temporal Transformations , 2020, ECCV.

[36]  William T. Freeman,et al.  SpeedNet: Learning the Speediness in Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Weiping Wang,et al.  Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning , 2020, AAAI.

[38]  Runhao Zeng,et al.  RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning , 2020, AAAI.

[39]  Yi Li,et al.  RESOUND: Towards Action Recognition Without Representation Bias , 2018, ECCV.

[40]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[41]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[42]  Ming-Hsuan Yang,et al.  Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Zhang Zhang,et al.  Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-based Action Recognition , 2020, ACM Multimedia.

[45]  Limin Wang,et al.  Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Sergio Escalera,et al.  Gate-Shift Networks for Video Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Andrew Zisserman,et al.  Self-supervised Co-training for Video Representation Learning , 2020, NeurIPS.

[48]  Leonidas J. Guibas,et al.  Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[51]  Yuting Gao,et al.  Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[53]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Michael S. Ryoo,et al.  Evolving Losses for Unsupervised Video Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Björn Ommer,et al.  Improving Spatiotemporal Self-Supervision by Deep Reinforcement Learning , 2018, ECCV.

[56]  Thomas Brox,et al.  Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[58]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Longlong Jing,et al.  Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction. , 2018, 1811.11387.

[61]  Xin Li,et al.  Deep Concept-wise Temporal Convolutional Networks for Action Localization , 2019, ACM Multimedia.

[62]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[63]  Hailin Jin,et al.  Steering Self-Supervised Feature Learning Beyond Local Pixel Statistics , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).