Unsupervised Representation Learning by Sorting Sequences

We present an unsupervised representation learning approach using videos without semantic labels. We leverage the temporal coherence as a supervisory signal by formulating representation learning as a sequence sorting task. We take temporally shuffled frames (i.e., in non-chronological order) as inputs and train a convolutional neural network to sort the shuffled sequences. Similar to comparison-based sorting algorithms, we propose to extract features from all frame pairs and aggregate them to predict the correct order. As sorting shuffled image sequence requires an understanding of the statistical temporal structure of images, training with such a proxy task allows us to learn rich and generalizable visual representation. We validate the effectiveness of the learned representation using our method as pre-training on high-level recognition problems. The experimental results show that our method compares favorably against state-of-the-art methods on action recognition, image classification, and object detection tasks.

[1]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[2]  Trevor Darrell,et al.  Data-dependent Initializations of Convolutional Neural Networks , 2015, ICLR.

[3]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[4]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Gregory Shakhnarovich,et al.  Learning Representations for Automatic Colorization , 2016, ECCV.

[8]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[9]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[10]  Narendra Ahuja,et al.  Unsupervised Visual Representation Learning by Graph-Based Consistent Constraints , 2016, ECCV.

[11]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[12]  Hongdong Li,et al.  Learning Image Matching by Simply Watching Video , 2016, ECCV.

[13]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[14]  Kristen Grauman,et al.  Slow and Steady Feature Analysis: Higher Order Temporal Coherence in Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Alexei A. Efros,et al.  Mid-level Visual Element Discovery as Discriminative Mode Seeking , 2013, NIPS.

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[18]  Jean Ponce,et al.  Learning Discriminative Part Detectors for Image Classification and Cosegmentation , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[21]  Kristen Grauman,et al.  Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Edward H. Adelson,et al.  Learning visual groups from co-occurrences in space and time , 2015, ArXiv.

[23]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[24]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Alexei A. Efros,et al.  Using Multiple Segmentations to Discover Objects and their Extent in Image Collections , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[26]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[27]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[28]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[29]  Gabriel Kreiman,et al.  Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[30]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[31]  Abhinav Gupta,et al.  Pose from Action: Unsupervised Learning of Pose Features based on Motion , 2016, ArXiv.

[32]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Ali Farhadi,et al.  Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[36]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[37]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[38]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[39]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[40]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[41]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[42]  Alexei A. Efros,et al.  Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.