Structured Visual Search via Composition-aware Learning

This paper studies visual search using structured queries. The structure is in the form of a 2D composition that encodes the position and the category of the objects. The transformation of the position and the category of the objects leads to a continuous-valued relationship between visual compositions, which carries highly beneficial information, although not leveraged by previous techniques. To that end, in this work, our goal is to leverage these continuous relationships by using the notion of symmetry in equivariance. Our model output is trained to change symmetrically with respect to the input transformations, leading to a sensitive feature space. Doing so leads to a highly efficient search technique, as our approach learns from fewer data using a smaller feature space. Experiments on two large-scale benchmarks of MS-COCO and HICO-DET demonstrates that our approach leads to a considerable gain in the performance against competing techniques.

[1]  Atsuto Maki,et al.  Visual Instance Retrieval with Deep Convolutional Networks , 2014, ICLR.

[2]  Jan C. van Gemert,et al.  On Translation Invariance in CNNs: Convolutional Layers can Exploit Absolute Spatial Location , 2020, CVPR.

[3]  Xavier Giró-i-Nieto,et al.  Class-Weighted Convolutional Features for Visual Instance Search , 2017, BMVC.

[4]  Ivan Laptev,et al.  Deep Metric Learning Beyond Binary Supervision , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[6]  Yun Fu,et al.  Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Gabriel J. Brostow,et al.  CubeNet: Equivariance to 3D Rotation and Translation , 2018, ECCV.

[8]  Bryan Peterson Learning to See Creatively: Design, Color & Composition in Photography , 2003 .

[9]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[10]  Yu-Gang Jiang,et al.  Towards Optimal CNN Descriptors for Large-Scale Image Retrieval , 2019, ACM Multimedia.

[11]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[12]  Hailin Jin,et al.  Spatial-Semantic Image Search by Visual Feature Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Albert Gordo,et al.  Deep Image Retrieval: Learning Global Representations for Image Search , 2016, ECCV.

[14]  Richard Zhang,et al.  Making Convolutional Networks Shift-Invariant Again , 2019, ICML.

[15]  Jeff Donahue,et al.  Visual Search at Pinterest , 2015, KDD.

[16]  Toshihiko Yamasaki,et al.  Efficient and interactive spatial-semantic image retrieval , 2018, Multimedia Tools and Applications.

[17]  Xiaogang Wang,et al.  CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Kristen Grauman,et al.  Slow and Steady Feature Analysis: Higher Order Temporal Coherence in Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Guiguang Ding,et al.  Cross-Modal Image-Text Retrieval with Semantic Consistency , 2019, ACM Multimedia.

[20]  Maurice Weiler,et al.  Learning Steerable Filters for Rotation Equivariant CNNs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Adam Pease,et al.  Representation and Retrieval of Images by Means of Spatial Relations Between Objects , 2019, AAAI Spring Symposium: Combining Machine Learning with Knowledge Engineering.

[22]  Ondrej Chum,et al.  CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples , 2016, ECCV.

[23]  Qi Tian,et al.  SIFT Meets CNN: A Decade Survey of Instance Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Nikos Komodakis,et al.  Rotation Equivariant Vector Field Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Maurice Weiler,et al.  General E(2)-Equivariant Steerable CNNs , 2019, NeurIPS.

[26]  Hao Xu,et al.  Image search by concept map , 2010, SIGIR '10.

[27]  Victor S. Lempitsky,et al.  Aggregating Local Deep Features for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Mert Kilickaya,et al.  Diagnosing Rarity in Human-object Interaction Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[29]  Laurent Amsaleg,et al.  Dynamicity and Durability in Scalable Visual Instance Search , 2018, ArXiv.

[30]  Cees Snoek,et al.  Spherical Regression: Learning Viewpoints, Surface Normals and 3D Rotations on N-Spheres , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[32]  Rares Ambrus,et al.  Efficient retrieval of arbitrary objects from long-term robot observations , 2017, Robotics Auton. Syst..

[33]  Toshihiko Yamasaki,et al.  Efficient and Interactive Spatial-Semantic Image Retrieval , 2018, MMM.

[34]  Kerry Rodden,et al.  How do people manage their digital photographs? , 2003, CHI '03.

[35]  Kristen Grauman,et al.  Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[37]  Hwann-Tzong Chen,et al.  See-Through-Text Grouping for Referring Image Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Jia Deng,et al.  Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[39]  Ioannis A. Kakadiaris,et al.  Adversarial Representation Learning for Text-to-Image Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Ivan Laptev,et al.  Thin-Slicing for Pose: Learning to Understand Pose without Explicit Pose Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Mert Kilickaya,et al.  Self-Selective Context for Interaction Recognition , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[42]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[43]  Giorgos Tolias,et al.  Fine-Tuning CNN Image Retrieval with No Human Annotation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Ayanna M. Howard,et al.  Retrieving experience: Interactive instance-based learning methods for building robot companions , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[45]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Max Welling,et al.  Group Equivariant Convolutional Networks , 2016, ICML.

[47]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[48]  Kostas Daniilidis,et al.  Spin-Weighted Spherical CNNs , 2020, NeurIPS.

[49]  Bo Yang,et al.  Spatial-Content Image Search in Complex Scenes , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[50]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Shih-Fu Chang,et al.  Attributes and categories for generic instance search from one example , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Yafei Zhang,et al.  Nonlinear embedding neural codes for visual instance retrieval , 2018, Neurocomputing.

[53]  Greg Mori,et al.  Pose Embeddings: A Deep Architecture for Learning to Match Human Poses , 2015, ArXiv.