论文信息 - Space-Time Correspondence as a Contrastive Random Walk

Space-Time Correspondence as a Contrastive Random Walk

This paper proposes a simple self-supervised approach for learning representations for visual correspondence from raw video. We cast correspondence as link prediction in a space-time graph constructed from a video. In this graph, the nodes are patches sampled from each frame, and nodes adjacent in time can share a directed edge. We learn a node embedding in which pairwise similarity defines transition probabilities of a random walk. Prediction of long-range correspondence is efficiently computed as a walk along this graph. The embedding learns to guide the walk by placing high probability along paths of correspondence. Targets are formed without supervision, by cycle-consistency: we train the embedding to maximize the likelihood of returning to the initial node when walking along a graph constructed from a `palindrome' of frames. We demonstrate that the approach allows for learning representations from large unlabeled video. Despite its simplicity, the method outperforms the self-supervised state-of-the-art on a variety of label propagation tasks involving objects, semantic parts, and pose. Moreover, we show that self-supervised adaptation at test-time and edge dropout improve transfer for object-level correspondence.

Allan Jabri | Andrew Owens | Alexei A. Efros

[1] Martial Hebert,et al. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[2] Charless C. Fowlkes,et al. Globally-optimal greedy algorithms for tracking a variable number of objects , 2011, CVPR 2011.

[3] Trevor Darrell,et al. Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[5] Alexei A. Efros,et al. Learning Dense Correspondence via 3D-Guided Cycle Consistency , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Razvan Pascanu,et al. Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[7] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[8] Nitish Srivastava. Unsupervised Learning of Visual Representations using Videos , 2015 .

[9] David A. Forsyth,et al. Computer Vision - A Modern Approach, Second Edition , 2011 .

[10] Andrew Zisserman,et al. Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11] Luc Van Gool,et al. The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[12] Li Fei-Fei,et al. Unsupervised Learning of Long-Term Motion Dynamics for Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Michael F. Cohen,et al. Real-time hyperlapse creation via optimal frame selection , 2015, ACM Trans. Graph..

[14] Ali Razavi,et al. Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[15] Jürgen Schmidhuber,et al. Neural Expectation Maximization , 2017, NIPS.

[16] Jure Leskovec,et al. node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[17] Armand Joulin,et al. Unsupervised Learning by Predicting Noise , 2017, ICML.

[18] K.-K. Maninis,et al. Video Object Segmentation without Temporal Information , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19] Luc Van Gool,et al. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Ali Farhadi,et al. Watching the World Go By: Representation Learning from Unlabeled Videos , 2020, ArXiv.

[21] Leon A. Gatys,et al. A Neural Algorithm of Artistic Style , 2015, ArXiv.

[22] Phillip Isola,et al. Contrastive Multiview Coding , 2019, ECCV.

[23] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[24] Shu Kong,et al. Multigrid Predictive Filter Flow for Unsupervised Learning on Videos , 2019, ArXiv.

[25] Klaus Greff,et al. Multi-Object Representation Learning with Iterative Variational Inference , 2019, ICML.

[26] Jonathan Tompson,et al. Unsupervised Learning of Spatiotemporally Coherent Metrics , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[27] Richard Sinkhorn,et al. Concerning nonnegative matrices and doubly stochastic matrices , 1967 .

[28] David A. Forsyth,et al. Strike a pose: tracking people by finding stylized poses , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[29] Andrew Zisserman,et al. Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[30] Tomasz Malisiewicz,et al. SuperGlue: Learning Feature Matching With Graph Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Kaiming He,et al. Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32] Bastian Leibe,et al. PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation , 2018, ACCV.

[33] Afshin Dehghan,et al. GMCP-Tracker: Global Multi-object Tracking Using Generalized Minimum Clique Graphs , 2012, ECCV.

[34] Virginia R. de Sa,et al. Learning Classification with Unlabeled Data , 1993, NIPS.

[35] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36] M. Wertheimer. Laws of organization in perceptual forms. , 1938 .

[37] Elise van der Pol,et al. Contrastive Learning of Structured World Models , 2020, ICLR.

[38] R Devon Hjelm,et al. Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[39] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[40] Laura Leal-Taix'e,et al. Learning a Neural Solver for Multiple Object Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Stella X. Yu,et al. Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[43] Pierre Sermanet,et al. Online Object Representations with Contrastive Learning , 2019, ArXiv.

[44] Zihang Lai,et al. Self-supervised Learning for Video Correspondence Flow , 2019, ArXiv.

[45] Mingzhe Wang,et al. LINE: Large-scale Information Network Embedding , 2015, WWW.

[46] Alexei A. Efros,et al. Test-Time Training with Self-Supervision for Generalization under Distribution Shifts , 2019, ICML.

[47] Jonathan Tompson,et al. Temporal Cycle-Consistency Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49] Michal Irani,et al. "Zero-Shot" Super-Resolution Using Deep Internal Learning , 2017, CVPR.

[50] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Deva Ramanan,et al. Online Model Distillation for Efficient Video Inference , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52] Aggelos K. Katsaggelos,et al. Efficient Video Object Segmentation via Network Modulation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54] Yann LeCun,et al. Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[55] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[56] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[57] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[58] Jure Leskovec,et al. Inductive Representation Learning on Large Graphs , 2017, NIPS.

[59] Jitendra Malik,et al. Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[60] Michael Tschannen,et al. Self-Supervised Learning of Video-Induced Visual Invariances , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61] Geoffrey E. Hinton,et al. Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[62] Wei Liu,et al. Unsupervised Deep Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63] Hongyuan Zha,et al. Gromov-Wasserstein Learning for Graph Matching and Node Embedding , 2019, ICML.

[64] Abhinav Gupta,et al. Transitive Invariance for Self-Supervised Visual Representation Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[65] Trevor Darrell,et al. Learning Features by Watching Objects Move , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66] Steven M. Seitz,et al. Filter flow , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[67] Arnold W. M. Smeulders,et al. Long-term Tracking in the Wild: A Benchmark , 2018, ECCV.

[68] Serge J. Belongie,et al. What went where , 2003, CVPR 2003.

[69] Qiang Wang,et al. Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70] Fei-Fei Li,et al. Efficient Image and Video Co-localization with Frank-Wolfe Algorithm , 2014, ECCV.

[71] Erika Lu,et al. MAST: A Memory-Augmented Self-Supervised Tracker , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72] Ning Xu,et al. Video Object Segmentation Using Space-Time Memory Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[73] Gabriel Peyré,et al. Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[74] Ersin Yumer,et al. Self-supervised Learning of Motion Capture , 2017, NIPS.

[75] Pietro Liò,et al. Graph Attention Networks , 2017, ICLR.

[76] Yann LeCun,et al. Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[77] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[78] Thomas Brox,et al. Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[79] Sergey Levine,et al. Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[80] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[81] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82] E. Adelson,et al. Analyzing gait with spatiotemporal surfaces , 1994, Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects.

[83] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[84] James M. Rehg,et al. Unsupervised Learning of Edges , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85] Jitendra Malik,et al. Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[86] Yoshua Bengio,et al. Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[87] Ramakant Nevatia,et al. Global data association for multi-object tracking using network flows , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[88] Liang Lin,et al. Adaptive Temporal Encoding Network for Video Instance-level Human Parsing , 2018, ACM Multimedia.

[89] Edward H. Adelson,et al. Learning visual groups from co-occurrences in space and time , 2015, ArXiv.

[90] Gabriel Kreiman,et al. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[91] Pascal Fua,et al. Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Multiple Object Tracking Using K-shortest Paths Optimization , 2022 .

[92] Bastian Leibe,et al. Online Adaptation of Convolutional Neural Networks for Video Object Segmentation , 2017, BMVC.

[93] Andreas Geiger,et al. MOTS: Multi-Object Tracking and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[94] Gareth Funka-Lea,et al. Graph Cuts and Efficient N-D Image Segmentation , 2006, International Journal of Computer Vision.

[95] Kristen Grauman,et al. Object-Centric Representation Learning from Unlabeled Videos , 2016, ACCV.

[96] Lorenzo Torresani,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[97] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[98] Lihi Zelnik-Manor,et al. Event-based analysis of video , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[99] Kristen Grauman,et al. Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[100] Gabriel Peyré,et al. Gromov-Wasserstein Averaging of Kernel and Distance Matrices , 2016, ICML.

[101] Kaiming He,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[102] Aapo Hyvärinen,et al. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[103] Ingmar Posner,et al. GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations , 2019, ICLR.

[104] Jure Leskovec,et al. Supervised random walks: predicting and recommending links in social networks , 2010, WSDM '11.

[105] Alexei A. Efros,et al. Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[106] R. Zemel,et al. Neural Relational Inference for Interacting Systems , 2018, ICML.

[107] Jure Leskovec,et al. Representation Learning on Graphs: Methods and Applications , 2017, IEEE Data Eng. Bull..

[108] Sergio Guadarrama,et al. Tracking Emerges by Colorizing Videos , 2018, ECCV.

[109] Luc Van Gool,et al. One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[110] Jianbo Shi,et al. Learning Segmentation by Random Walks , 2000, NIPS.

[111] Max Welling,et al. Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[112] Marco Cuturi,et al. Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[113] Takeo Kanade,et al. An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[114] Jian Sun,et al. Convolutional neural networks at constrained time cost , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[115] Paolo Favaro,et al. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[116] Xueting Li,et al. Joint-task Self-supervised Learning for Temporal Correspondence , 2019, NeurIPS.

[117] Jitendra Malik,et al. Motion segmentation and tracking using normalized cuts , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[118] Jitendra Malik,et al. Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[119] Jason J. Corso,et al. Temporally consistent multi-class video-object segmentation with the Video Graph-Shifts algorithm , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).