Image Search with Text Feedback by Deep Hierarchical Attention Mutual Information Maximization

Image retrieval with text feedback is an emerging research topic with the objective of integrating inputs from multiple modalities as queries. In this paper, queries contain a reference image plus text feedback that describes modifications between this image and the desired image. The existing work for this task mainly focuses on designing a new fusion network to compose the image and text. Still, little research pays attention to the modality gap caused by the inconsistent distribution of features from different modalities, which dramatically influences the feature fusion and similarity learning between queries and the desired image. We propose a Distribution-Aligned Text-based Image Retrieval (DATIR) model, which consists of attention mutual information maximization and hierarchical mutual information maximization, to bridge this gap by increasing non-linear statistic dependencies between representations of different modalities. More specifically, attention mutual information maximization narrows the modality gap between different input modalities by maximizing mutual information between the text representation and its semantically consistent representation captured from the reference image and the desired image by the difference transformer. For hierarchical mutual information maximization, it aligns distributions of features from the image modality and the fusion modality by estimating mutual information between a single-layer representation in the fusion network and the multi-level representations in the desired image encoder. Extensive experiments on three large-scale benchmark datasets demonstrate that we can bridge the modality gap between different modalities and achieve state-of-the-art retrieval performance.

[1]  Bo Zhao,et al.  Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  James Hays,et al.  The sketchy database , 2016, ACM Trans. Graph..

[3]  Suzanna Becker,et al.  Mutual information maximization: models of cortical self-organization. , 1996, Network.

[4]  Qirong Mao,et al.  Joint Attribute Manipulation and Modality Alignment Learning for Composing Text and Image to Image Retrieval , 2020, ACM Multimedia.

[5]  S. Varadhan,et al.  Asymptotic evaluation of certain Markov process expectations for large time , 1975 .

[6]  Rohan Ramanath,et al.  An Attentive Survey of Attention Models , 2019, ACM Trans. Intell. Syst. Technol..

[7]  Bart Thomee,et al.  Interactive search in image retrieval: a survey , 2012, International Journal of Multimedia Information Retrieval.

[8]  Shaohua Kevin Zhou,et al.  Deep Networks and Mutual Information Maximization for Cross-Modal Medical Image Synthesis , 2017, Deep Learning for Medical Image Analysis.

[9]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Alexander C. Berg,et al.  Automatic Attribute Discovery and Characterization from Noisy Web Data , 2010, ECCV.

[11]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Hui Wang,et al.  Bootstrap dual complementary hashing with semi-supervised re-ranking for image retrieval , 2020, Neurocomputing.

[13]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[14]  Kristen Grauman,et al.  Attributes as Operators , 2018, ECCV.

[15]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[16]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[17]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[20]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[21]  Peng Gao,et al.  Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[23]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Rogério Schmidt Feris,et al.  Dialog-based Interactive Image Retrieval , 2018, NeurIPS.

[26]  Thomas S. Huang,et al.  Relevance feedback in image retrieval: A comprehensive review , 2003, Multimedia Systems.

[27]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Björn Ommer,et al.  Cross and Learn: Cross-Modal Self-Supervision , 2018, GCPR.

[29]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[30]  Xiaogang Wang,et al.  Deep Learning Face Representation from Predicting 10,000 Classes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[32]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[33]  Steven C. H. Hoi,et al.  Learning Cross-Modal Embeddings With Adversarial Networks for Cooking Recipes and Food Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Shaogang Gong,et al.  Image Search With Text Feedback by Visiolinguistic Attention Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Thomas S. Huang,et al.  Relevance feedback: a power tool for interactive content-based image retrieval , 1998, IEEE Trans. Circuits Syst. Video Technol..

[36]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[37]  Yupeng Gao,et al.  Fashion IQ: A New Dataset towards Retrieving Images by Natural Language Feedback , 2019 .

[38]  Xiangwei Kong,et al.  Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation , 2019, ACM Multimedia.

[39]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[40]  Jiajun Bu,et al.  Cross-modal Image Retrieval with Deep Mutual Information Maximization , 2021, Neurocomputing.

[41]  Takayuki Okatani,et al.  Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Bohyung Han,et al.  Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Aaron C. Courville,et al.  MINE: Mutual Information Neural Estimation , 2018, ArXiv.

[44]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[45]  Adriana Kovashka,et al.  Attribute Pivots for Guiding Relevance Feedback in Image Search , 2013, 2013 IEEE International Conference on Computer Vision.

[46]  Loris Bazzani,et al.  Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval , 2020, ECCV.

[47]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Stan Sclaroff,et al.  Deep Metric Learning to Rank , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Adriana Kovashka,et al.  WhittleSearch: Image search with relative attribute feedback , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[51]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[52]  Albert Gordo,et al.  Deep Image Retrieval: Learning Global Representations for Image Search , 2016, ECCV.

[53]  Tao Xiang,et al.  Generalising Fine-Grained Sketch-Based Image Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[55]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Byoung-Tak Zhang,et al.  Multimodal Residual Learning for Visual QA , 2016, NIPS.

[57]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[58]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[59]  Li Fei-Fei,et al.  Composing Text and Image for Image Retrieval - an Empirical Odyssey , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Kristen Grauman,et al.  Thinking Outside the Pool: Active Training Image Creation for Relative Attributes , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Jo Yew Tham,et al.  Learning Attribute Representations with Localization for Flexible Fashion Search , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[63]  Yan Yan,et al.  Multi-Level Visual-Semantic Alignments with Relation-Wise Dual Attention Network for Image and Text Matching , 2019, IJCAI.

[64]  J. Kinney,et al.  Equitability, mutual information, and the maximal information coefficient , 2013, Proceedings of the National Academy of Sciences.

[65]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[66]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[67]  Helen Suzanna Becker,et al.  An information-theoretic unsupervised learning algorithm for neural networks , 1993 .

[68]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[69]  Larry S. Davis,et al.  Automatic Spatially-Aware Fashion Concept Discovery , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[70]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[71]  Zhihai He,et al.  Hybrid representation learning for cross-modal retrieval , 2019, Neurocomputing.

[72]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.