论文信息 - Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style

Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style

This paper studies the problem of zero-short sketch-based image retrieval (ZS-SBIR), however with two significant differentiators to prior art (i) we tackle all variants (inter-category, intra-category, and cross datasets) of ZS-SBIR with just one network (``everything''), and (ii) we would really like to understand how this sketch-photo matching operates (``explainable''). Our key innovation lies with the realization that such a cross-modal matching problem could be reduced to comparisons of groups of key local patches -- akin to the seasoned ``bag-of-words'' paradigm. Just with this change, we are able to achieve both of the aforementioned goals, with the added benefit of no longer requiring external semantic knowledge. Technically, ours is a transformer-based cross-modal network, with three novel components (i) a self-attention module with a learnable tokenizer to produce visual tokens that correspond to the most informative local regions, (ii) a cross-attention module to compute local correspondences between the visual tokens across two modalities, and finally (iii) a kernel-based relation network to assemble local putative matches and produce an overall similarity metric for a sketch-photo pair. Experiments show ours indeed delivers superior performances across all ZS-SBIR settings. The all important explainable goal is elegantly achieved by visualizing cross-modal token correspondences, and for the first time, via sketch to photo synthesis by universal replacement of all matched photo patches. Code and model are available at \url{https://github.com/buptLinfy/ZSE-SBIR}.

[1] Timothy M. Hospedales,et al. Deep Learning for Free-Hand Sketch: A Survey , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Heng Tao Shen,et al. TVT: Three-Way Vision Transformer through Multi-Modal Hypersphere Learning for Zero-Shot Sketch-Based Image Retrieval , 2022, AAAI.

[3] Shijian Lu,et al. Marginal Contrastive Correspondence for Guided Image Generation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Pinaki Nath Chowdhury,et al. Sketch3T: Test-Time Training for Zero-Shot SBIR , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] P. Xie,et al. Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations , 2022, ArXiv.

[6] Zeynep Akata,et al. BDA-SketRet: Bi-Level Domain Adaptation for Zero-Shot SBIR , 2022, Neurocomputing.

[7] Sridha Sridharan,et al. An Efficient Framework for Zero-Shot Sketch-Based Image Retrieval , 2021, Pattern Recognit..

[8] William Grisaitis,et al. The Animation Transformer: Visual Correspondence via Segment Matching , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9] Aming Wu,et al. Domain-Smoothing Network for Zero-Shot Sketch-Based Image Retrieval , 2021, IJCAI.

[10] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11] Tao Xiang,et al. StyleMeUp: Towards Style-Agnostic Sketch-Based Image Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Quanfu Fan,et al. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[14] Soma Biswas,et al. StyleGuide: Zero-Shot Sketch-Based Image Retrieval Using Style-Guided Image Generation , 2021, IEEE Transactions on Multimedia.

[15] Dacheng Tao,et al. Progressive Cross-Modal Semantic Network for Zero-Shot Sketch-Based Image Retrieval , 2020, IEEE Transactions on Image Processing.

[16] Ankush Gupta,et al. CrossTransformers: spatially-aware few-shot transfer , 2020, NeurIPS.

[17] Lu Yuan,et al. Cross-Domain Correspondence Learning for Exemplar-Based Image Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Rui Feng,et al. Zero-Shot Sketch-Based Image Retrieval via Graph Convolution Network , 2020, AAAI.

[19] Timothy M. Hospedales,et al. Sketch Less for More: On-the-Fly Fine-Grained Sketch-Based Image Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Zeynep Akata,et al. Semantically Tied Paired Cycle Consistency for Zero-Shot Sketch-Based Image Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Tao Xiang,et al. Generalising Fine-Grained Sketch-Based Image Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Josep Lladós,et al. Doodle to Search: Practical Zero-Shot Sketch-Based Image Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Qing Liu,et al. Semantic-Aware Knowledge Preservation for Zero-Shot Sketch-Based Image Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24] Jean Ponce,et al. SFNet: Learning Object-Aware Semantic Correspondence , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Wojciech Samek,et al. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning , 2019, Explainable AI.

[26] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27] Anurag Mittal,et al. A Zero-Shot Framework for Sketch-based Image Retrieval , 2018, ECCV.

[28] Pascal Fua,et al. LF-Net: Learning Local Features from Images , 2018, NeurIPS.

[29] Ling Shao,et al. Zero-Shot Sketch-Image Hashing , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30] Jun Guo,et al. SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31] Tao Xiang,et al. Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32] Cordelia Schmid,et al. Proposal Flow: Semantic Correspondences from Object Proposals , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33] Tao Xiang,et al. Deep Spatial-Semantic Attention for Fine-Grained Sketch-Based Image Retrieval , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[35] Jean Ponce,et al. SCNet: Learning Semantic Correspondence , 2017, ICCV.

[36] Shaogang Gong,et al. Semantic Autoencoder for Zero-Shot Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Ling Shao,et al. Deep Sketch Hashing: Fast Free-Hand Sketch-Based Image Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Seungryong Kim,et al. FCSS: Fully Convolutional Self-Similarity for Dense Semantic Correspondence , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Abhishek Das,et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[40] Feng Liu,et al. Sketch Me That Shoe , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Raquel Urtasun,et al. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[42] Frédéric Jurie,et al. Improving Semantic Embedding Consistency by Metric Learning for Zero-Shot Classiffication , 2016, ECCV.

[43] Xiaochun Cao,et al. SketchNet: Sketch Classification with Web Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Yang Yang,et al. Zero-Shot Hashing via Transferring Supervised Knowledge , 2016, ACM Multimedia.

[45] Vincent Lepetit,et al. LIFT: Learned Invariant Feature Transform , 2016, ECCV.

[46] Wei-Lun Chao,et al. Synthesized Classifiers for Zero-Shot Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Venkatesh Saligrama,et al. Zero-Shot Learning via Semantic Similarity Embedding , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48] J. M. M. Montiel,et al. ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[49] Jose M. Saavedra,et al. Sketch based Image Retrieval using Learned KeyShapes (LKS) , 2015, BMVC.

[50] Trevor Darrell,et al. Do Convnets Learn Correspondence? , 2014, NIPS.

[51] Jose M. Saavedra,et al. Sketch based image retrieval using a soft computation of the histogram of edge local orientations (S-HELO) , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[52] Anurag Mittal,et al. Similarity-Invariant Sketch-Based Image Retrieval in Large Databases , 2014, ECCV.

[53] Tatsuya Harada,et al. Image Reconstruction from Bag-of-Visual-Words , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54] Rui Hu,et al. A performance evaluation of gradient field HOG descriptor for sketch based image retrieval , 2013, Comput. Vis. Image Underst..

[55] Marc Alexa,et al. How do humans sketch objects? , 2012, ACM Trans. Graph..

[56] Liqing Zhang,et al. Edgel index for large-scale sketch-based image search , 2011, CVPR 2011.

[57] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[58] Olivier Stasse,et al. MonoSLAM: Real-Time Single Camera SLAM , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59] David G. Lowe,et al. Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[60] Gabriela Csurka,et al. Visual categorization with bags of keypoints , 2002, eccv 2004.

[61] Andrew Zisserman,et al. Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[62] Robert C. Bolles,et al. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.