论文信息 - Joint-task Self-supervised Learning for Temporal Correspondence

Joint-task Self-supervised Learning for Temporal Correspondence

This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions and establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet.

[1] Jan Kautz,et al. Switchable Temporal Propagation Network , 2018, ECCV.

[2] Larry S. Davis,et al. Efficient mean-shift tracking via a new similarity measure , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3] Josef Sivic,et al. End-to-End Weakly-Supervised Semantic Alignment , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Takeo Kanade,et al. A Multiple-Baseline Stereo , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[5] Leon A. Gatys,et al. Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Ce Liu,et al. Towards Longer Long-Range Motion Trajectories , 2012, BMVC.

[7] Jitendra Malik,et al. Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8] Ning Xu,et al. YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark , 2018, ArXiv.

[9] Yi Yang,et al. Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11] Jan Kautz,et al. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12] Rui Caseiro,et al. High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13] Luca Bertinetto,et al. Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[14] Junliang Xing,et al. Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15] Yuting Zhang,et al. Unsupervised Discovery of Object Landmarks as Structural Representations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16] Jianke Zhu,et al. A Scale Adaptive Kernel Correlation Filter Tracker with Feature Integration , 2014, ECCV Workshops.

[17] Qiang Wang,et al. DCFNet: Discriminant Correlation Filters Network for Visual Tracking , 2017, ArXiv.

[18] Ming-Hsuan Yang,et al. Long-term correlation tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Michael J. Black,et al. A Naturalistic Open Source Movie for Optical Flow Evaluation , 2012, ECCV.

[20] Stefan Roth,et al. People-tracking-by-detection and people-detection-by-tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21] Wei Liu,et al. Unsupervised Deep Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Aggelos K. Katsaggelos,et al. Efficient Video Object Segmentation via Network Modulation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23] Abhinav Gupta,et al. Transitive Invariance for Self-Supervised Visual Representation Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24] Sergio Guadarrama,et al. Tracking Emerges by Colorizing Videos , 2018, ECCV.

[25] Andrew Zisserman,et al. Spatial Transformer Networks , 2015, NIPS.

[26] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27] Liang Lin,et al. Adaptive Temporal Encoding Network for Video Instance-level Human Parsing , 2018, ACM Multimedia.

[28] Cordelia Schmid,et al. Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[29] Luc Van Gool,et al. One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Shu Kong,et al. Multigrid Predictive Filter Flow for Unsupervised Learning on Videos , 2019, ArXiv.

[31] Antonio Torralba,et al. SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32] Arnold W. M. Smeulders,et al. UvA-DARE (Digital Academic Repository) Siamese Instance Search for Tracking , 2016 .

[33] David A. Shamma,et al. YFCC100M , 2015, Commun. ACM.

[34] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Ming-Hsuan Yang,et al. Hierarchical Convolutional Features for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36] Zdenek Kalal,et al. Tracking-Learning-Detection , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[38] David A. Shamma,et al. The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[39] Wei Wu,et al. High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40] Ming-Hsuan Yang,et al. Object Tracking Benchmark , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41] Matthijs Douze,et al. Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[42] Dit-Yan Yeung,et al. Learning a Deep Compact Image Representation for Visual Tracking , 2013, NIPS.

[43] Luca Bertinetto,et al. End-to-End Representation Learning for Correlation Filter Based Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[45] Allan Jabri,et al. Learning Correspondence From the Cycle-Consistency of Time , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Philip H. S. Torr,et al. Holistic, Instance-Level Human Parsing , 2017, BMVC.

[47] Thomas Brox,et al. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Zihang Lai,et al. Self-supervised Learning for Video Correspondence Flow , 2019, ArXiv.

[49] Qiang Wang,et al. Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Dorin Comaniciu,et al. Real-time tracking of non-rigid objects using mean shift , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[51] Jitendra Malik,et al. From Lifestyle Vlogs to Everyday Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52] Pavlo Molchanov,et al. SCOPS: Self-Supervised Co-Part Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53] Luc Van Gool,et al. The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[54] Yiannis Demiris,et al. Attentional Correlation Filter Network for Adaptive Visual Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Thomas Brox,et al. FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[56] Thomas Brox,et al. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Takeo Kanade,et al. An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.