Deep Semantic Matching with Foreground Detection and Cycle-Consistency

Establishing dense semantic correspondences between object instances remains a challenging problem due to background clutter, significant scale and pose differences, and large intra-class variations. In this paper, we present an end-to-end trainable network for learning semantic correspondences using only matching image pairs without manual keypoint correspondence annotations. To facilitate network training with this weaker form of supervision, we (1) explicitly estimate the foreground regions to suppress the effect of background clutter and (2) develop cycle-consistent losses to enforce the predicted transformations across multiple images to be geometrically plausible and consistent. We train the proposed model using the PF-PASCAL dataset and evaluate the performance on the PF-PASCAL, PF-WILLOW, and TSS datasets. Extensive experimental results show that the proposed approach achieves favorably performance compared to the state-of-the-art. The code and model will be available at https://yunchunchen.github.io/WeakMatchNet/.

[1]  Yong Jae Lee,et al.  FlowWeb: Joint image set alignment by weaving consistent, pixel-wise correspondences , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yen-Yu Lin,et al.  Progressive Feature Matching with Alternate Descriptor Selection and Correspondence Enrichment , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Cordelia Schmid,et al.  Proposal Flow: Semantic Correspondences from Object Proposals , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Seungryong Kim,et al.  FCSS: Fully Convolutional Self-Similarity for Dense Semantic Correspondence , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Maneesh Kumar Singh,et al.  DRIT++: Diverse Image-to-Image Translation via Disentangled Representations , 2019, International Journal of Computer Vision.

[6]  Stephen Lin,et al.  DCTM: Discrete-Continuous Transformation Matching for Semantic Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Yu-Chiang Frank Wang,et al.  Cross-Resolution Adversarial Dual Network for Person Re-Identification and Beyond , 2020, ArXiv.

[8]  Josef Sivic,et al.  End-to-End Weakly-Supervised Semantic Alignment , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Narendra Ahuja,et al.  Temporally coherent completion of dynamic video , 2016, ACM Trans. Graph..

[10]  Fan Yang,et al.  Object-Aware Dense Semantic Correspondence , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Bing-Yu Chen,et al.  Co-Segmentation Guided Hough Transform for Robust Feature Matching , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Andrea Vedaldi,et al.  AnchorNet: A Weakly Supervised Network to Learn Geometry-Sensitive Features for Semantic Matching , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Josef Sivic,et al.  Convolutional Neural Network Architecture for Geometric Matching , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[16]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[17]  Yung-Yu Chuang,et al.  Robust image alignment with multiple feature descriptors and matching-guided neighborhoods , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Adrian Hilton,et al.  Semantically Coherent Co-Segmentation and Reconstruction of Dynamic Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Yu-Chiang Frank Wang,et al.  Learning Resolution-Invariant Deep Representations for Person Re-Identification , 2019, AAAI.

[21]  Yung-Yu Chuang,et al.  Co-attention CNNs for Unsupervised Object Co-segmentation , 2018, IJCAI.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Vincent Lepetit,et al.  DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[25]  Jean Ponce,et al.  SCNet: Learning Semantic Correspondence , 2017, ICCV.

[26]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Ce Liu,et al.  Deformable Spatial Pyramid Matching for Fast Dense Correspondences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Björn Ommer,et al.  Deep Semantic Feature Matching , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ming-Hsuan Yang,et al.  Show, Match and Segment: Joint Weakly Supervised Learning of Semantic Matching and Object Co-segmentation. , 2020, IEEE transactions on pattern analysis and machine intelligence.

[30]  Ersin Yumer,et al.  Learning Blind Video Temporal Consistency , 2018, ECCV.

[31]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[32]  Alexei A. Efros,et al.  Learning Dense Correspondence via 3D-Guided Cycle Consistency , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Vincent Lepetit,et al.  LIFT: Learned Invariant Feature Transform , 2016, ECCV.

[34]  Yu-Chiang Frank Wang,et al.  Recover and Identify: A Generative Dual Model for Cross-Resolution Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Zhi-Gang Zheng,et al.  A region based stereo matching algorithm using cooperative optimization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  David W. Jacobs,et al.  WarpNet: Weakly Supervised Matching for Single-View Reconstruction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Silvio Savarese,et al.  Universal Correspondence Network , 2016, NIPS.

[38]  Bing-Yu Chen,et al.  Matching Images With Multiple Descriptors: An Unsupervised Approach for Locally Adaptive Descriptor Selection , 2015, IEEE Transactions on Image Processing.

[39]  Yoichi Sato,et al.  Joint Recovery of Dense Correspondence and Cosegmentation in Two Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Richard Szeliski,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, International Journal of Computer Vision.

[42]  Xiaowei Zhou,et al.  Multi-image Matching via Fast Alternating Minimization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Nikos Komodakis,et al.  Learning to compare image patches via convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Stefan Roth,et al.  UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss , 2017, AAAI.

[45]  Jia-Bin Huang,et al.  DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency , 2018, ECCV.

[46]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[47]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Ming-Hsuan Yang,et al.  CrDoCo: Pixel-Level Domain Transfer With Cross-Domain Consistency , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).