论文信息 - Cross and Learn: Cross-Modal Self-Supervision

Cross and Learn: Cross-Modal Self-Supervision

In this paper we present a self-supervised method for representation learning utilizing two different modalities. Based on the observation that cross-modal information has a high semantic meaning we propose a method to effectively exploit this signal. For our approach we utilize video data since it is available on a large scale and provides easily accessible modalities given by RGB and optical flow. We demonstrate state-of-the-art performance on highly contested action recognition datasets in the context of self-supervised learning. We show that our feature representation also transfers to other tasks and conduct extensive ablation studies to validate our core contributions.

[1] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Björn Ommer,et al. Unsupervised Video Understanding by Reconciliation of Posture Similarities , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[5] Ali Farhadi,et al. Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Björn Ommer,et al. Learning Where to Drive by Watching Others , 2017, GCPR.

[7] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[9] Antonio Torralba,et al. Generating Videos with Scene Dynamics , 2016, NIPS.

[10] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[11] Li Fei-Fei,et al. Unsupervised Learning of Long-Term Motion Dynamics for Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Ming-Hsuan Yang,et al. Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13] Abhinav Gupta,et al. Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14] Thomas Brox,et al. Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[15] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[16] Paolo Favaro,et al. Boosting Self-Supervised Learning via Knowledge Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17] ZissermanAndrew,et al. The Pascal Visual Object Classes Challenge , 2015 .

[18] Björn Ommer,et al. Improving Spatiotemporal Self-Supervision by Deep Reinforcement Learning , 2018, ECCV.

[19] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[20] Trevor Darrell,et al. Data-dependent Initializations of Convolutional Neural Networks , 2015, ICLR.

[21] Abhinav Gupta,et al. Pose from Action: Unsupervised Learning of Pose Features based on Motion , 2016, ArXiv.

[22] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[23] Jiebo Luo,et al. Deep Multimodal Representation Learning from Temporal Data , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Gregory Shakhnarovich,et al. Colorization as a Proxy Task for Visual Understanding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Björn Ommer,et al. CliqueCNN: Deep Unsupervised Exemplar Learning , 2016, NIPS.

[26] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[28] Jianguo Zhang,et al. The PASCAL Visual Object Classes Challenge , 2006 .

[29] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[30] Efstratios Gavves,et al. Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Luc Van Gool,et al. The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[32] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33] Paolo Favaro,et al. Representation Learning by Learning to Count , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34] Jitendra Malik,et al. Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[36] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[37] Jitendra Malik,et al. Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38] Martial Hebert,et al. Unsupervised Learning using Sequential Verification for Action Recognition , 2016, ArXiv.

[39] Björn Ommer,et al. LSTM Self-Supervision for Detailed Behavior Analysis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Paolo Favaro,et al. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[41] Björn Ommer,et al. Deep unsupervised learning of visual similarities , 2018, Pattern Recognit..

[42] Alexei A. Efros,et al. Colorful Image Colorization , 2016, ECCV.