Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style

This paper studies the problem of zero-short sketch-based image retrieval (ZS-SBIR), however with two significant differentiators to prior art (i) we tackle all variants (inter-category, intra-category, and cross datasets) of ZS-SBIR with just one network (``everything''), and (ii) we would really like to understand how this sketch-photo matching operates (``explainable''). Our key innovation lies with the realization that such a cross-modal matching problem could be reduced to comparisons of groups of key local patches -- akin to the seasoned ``bag-of-words'' paradigm. Just with this change, we are able to achieve both of the aforementioned goals, with the added benefit of no longer requiring external semantic knowledge. Technically, ours is a transformer-based cross-modal network, with three novel components (i) a self-attention module with a learnable tokenizer to produce visual tokens that correspond to the most informative local regions, (ii) a cross-attention module to compute local correspondences between the visual tokens across two modalities, and finally (iii) a kernel-based relation network to assemble local putative matches and produce an overall similarity metric for a sketch-photo pair. Experiments show ours indeed delivers superior performances across all ZS-SBIR settings. The all important explainable goal is elegantly achieved by visualizing cross-modal token correspondences, and for the first time, via sketch to photo synthesis by universal replacement of all matched photo patches. Code and model are available at \url{https://github.com/buptLinfy/ZSE-SBIR}.

[1]  Timothy M. Hospedales,et al.  Deep Learning for Free-Hand Sketch: A Survey , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Heng Tao Shen,et al.  TVT: Three-Way Vision Transformer through Multi-Modal Hypersphere Learning for Zero-Shot Sketch-Based Image Retrieval , 2022, AAAI.

[3]  Shijian Lu,et al.  Marginal Contrastive Correspondence for Guided Image Generation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Pinaki Nath Chowdhury,et al.  Sketch3T: Test-Time Training for Zero-Shot SBIR , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  P. Xie,et al.  Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations , 2022, ArXiv.

[6]  Zeynep Akata,et al.  BDA-SketRet: Bi-Level Domain Adaptation for Zero-Shot SBIR , 2022, Neurocomputing.

[7]  Sridha Sridharan,et al.  An Efficient Framework for Zero-Shot Sketch-Based Image Retrieval , 2021, Pattern Recognit..

[8]  William Grisaitis,et al.  The Animation Transformer: Visual Correspondence via Segment Matching , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Aming Wu,et al.  Domain-Smoothing Network for Zero-Shot Sketch-Based Image Retrieval , 2021, IJCAI.

[10]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Tao Xiang,et al.  StyleMeUp: Towards Style-Agnostic Sketch-Based Image Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Quanfu Fan,et al.  CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[14]  Soma Biswas,et al.  StyleGuide: Zero-Shot Sketch-Based Image Retrieval Using Style-Guided Image Generation , 2021, IEEE Transactions on Multimedia.

[15]  Dacheng Tao,et al.  Progressive Cross-Modal Semantic Network for Zero-Shot Sketch-Based Image Retrieval , 2020, IEEE Transactions on Image Processing.

[16]  Ankush Gupta,et al.  CrossTransformers: spatially-aware few-shot transfer , 2020, NeurIPS.

[17]  Lu Yuan,et al.  Cross-Domain Correspondence Learning for Exemplar-Based Image Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Rui Feng,et al.  Zero-Shot Sketch-Based Image Retrieval via Graph Convolution Network , 2020, AAAI.

[19]  Timothy M. Hospedales,et al.  Sketch Less for More: On-the-Fly Fine-Grained Sketch-Based Image Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Zeynep Akata,et al.  Semantically Tied Paired Cycle Consistency for Zero-Shot Sketch-Based Image Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Tao Xiang,et al.  Generalising Fine-Grained Sketch-Based Image Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Josep Lladós,et al.  Doodle to Search: Practical Zero-Shot Sketch-Based Image Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Qing Liu,et al.  Semantic-Aware Knowledge Preservation for Zero-Shot Sketch-Based Image Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Jean Ponce,et al.  SFNet: Learning Object-Aware Semantic Correspondence , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Wojciech Samek,et al.  Explainable AI: Interpreting, Explaining and Visualizing Deep Learning , 2019, Explainable AI.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Anurag Mittal,et al.  A Zero-Shot Framework for Sketch-based Image Retrieval , 2018, ECCV.

[28]  Pascal Fua,et al.  LF-Net: Learning Local Features from Images , 2018, NeurIPS.

[29]  Ling Shao,et al.  Zero-Shot Sketch-Image Hashing , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Jun Guo,et al.  SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Tao Xiang,et al.  Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Cordelia Schmid,et al.  Proposal Flow: Semantic Correspondences from Object Proposals , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Tao Xiang,et al.  Deep Spatial-Semantic Attention for Fine-Grained Sketch-Based Image Retrieval , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Jean Ponce,et al.  SCNet: Learning Semantic Correspondence , 2017, ICCV.

[36]  Shaogang Gong,et al.  Semantic Autoencoder for Zero-Shot Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ling Shao,et al.  Deep Sketch Hashing: Fast Free-Hand Sketch-Based Image Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Seungryong Kim,et al.  FCSS: Fully Convolutional Self-Similarity for Dense Semantic Correspondence , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Feng Liu,et al.  Sketch Me That Shoe , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Raquel Urtasun,et al.  Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[42]  Frédéric Jurie,et al.  Improving Semantic Embedding Consistency by Metric Learning for Zero-Shot Classiffication , 2016, ECCV.

[43]  Xiaochun Cao,et al.  SketchNet: Sketch Classification with Web Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yang Yang,et al.  Zero-Shot Hashing via Transferring Supervised Knowledge , 2016, ACM Multimedia.

[45]  Vincent Lepetit,et al.  LIFT: Learned Invariant Feature Transform , 2016, ECCV.

[46]  Wei-Lun Chao,et al.  Synthesized Classifiers for Zero-Shot Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Semantic Similarity Embedding , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[49]  Jose M. Saavedra,et al.  Sketch based Image Retrieval using Learned KeyShapes (LKS) , 2015, BMVC.

[50]  Trevor Darrell,et al.  Do Convnets Learn Correspondence? , 2014, NIPS.

[51]  Jose M. Saavedra,et al.  Sketch based image retrieval using a soft computation of the histogram of edge local orientations (S-HELO) , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[52]  Anurag Mittal,et al.  Similarity-Invariant Sketch-Based Image Retrieval in Large Databases , 2014, ECCV.

[53]  Tatsuya Harada,et al.  Image Reconstruction from Bag-of-Visual-Words , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Rui Hu,et al.  A performance evaluation of gradient field HOG descriptor for sketch based image retrieval , 2013, Comput. Vis. Image Underst..

[55]  Marc Alexa,et al.  How do humans sketch objects? , 2012, ACM Trans. Graph..

[56]  Liqing Zhang,et al.  Edgel index for large-scale sketch-based image search , 2011, CVPR 2011.

[57]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Olivier Stasse,et al.  MonoSLAM: Real-Time Single Camera SLAM , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[60]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[61]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[62]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.