RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network

In this paper, we study the compositional learning of images and texts for image retrieval. The query is given in the form of an image and text that describes the desired modifications to the image; the goal is to retrieve the target image that satisfies the given modifications and resembles the query by composing information in both the text and image modalities. To accomplish this task, we propose a simple new architecture using skip connections that can effectively encode the errors between the source and target images in the latent space. Furthermore, we introduce a novel method that combines the graph convolutional network (GCN) with existing composition methods. We find that the combination consistently improves the performance in a plug-and-play manner. We perform thorough and exhaustive experiments on several widely used datasets, and achieve state-of-theart scores on the task with our model. To ensure fairness in comparison, we suggest a strict standard for the evaluation because a small difference in the training conditions can significantly affect the final performance. We release our implementation, including that of all the compared methods, for reproducibility.

[1]  Byoung-Tak Zhang,et al.  Multimodal Residual Learning for Visual QA , 2016, NIPS.

[2]  Matthieu Cord,et al.  BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection , 2019, AAAI.

[3]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[4]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[5]  Yifan Zhang,et al.  Skeleton-Based Action Recognition With Shift Graph Convolutional Network , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Bo Zhao,et al.  Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Minchul Shin,et al.  Semi-supervised Feature-Level Attribute Manipulation for Fashion Image Retrieval , 2019, BMVC.

[8]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Bohyung Han,et al.  Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Martin Kleinsteuber,et al.  Compositional Learning of Image-Text Query for Image Retrieval , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[11]  Serge J. Belongie,et al.  Neural Naturalist: Generating Fine-Grained Image Comparisons , 2019, EMNLP/IJCNLP.

[12]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[13]  Yupeng Gao,et al.  Fashion IQ: A New Dataset towards Retrieving Images by Natural Language Feedback , 2019 .

[14]  Zheng Liu,et al.  OD-GCN: Object Detection Boosted by Knowledge GCN , 2019, 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  Jian Sun,et al.  View-GCN: View-Based Graph Convolutional Network for 3D Shape Analysis , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Minchul Shin,et al.  Fashion-IQ 2020 Challenge 2nd Place Team's Solution , 2020, ArXiv.

[18]  Li Fei-Fei,et al.  Composing Text and Image for Image Retrieval - an Empirical Odyssey , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Bohyung Han,et al.  CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Stephen Gould,et al.  Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Maksims Volkovs,et al.  Guided Similarity Separation for Image Retrieval , 2019, NeurIPS.

[22]  Loris Bazzani,et al.  Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval , 2020, ECCV.

[23]  Jo Yew Tham,et al.  Learning Attribute Representations with Localization for Flexible Fashion Search , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[25]  Gunhee Kim,et al.  CurlingNet: Compositional Learning between Images and Text for Fashion IQ Data , 2020, ArXiv.

[26]  Ayush Chopra,et al.  TRACE: Transform Aggregate and Compose Visiolinguistic Representations for Image Search with Text Feedback , 2020, ArXiv.

[27]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Zhou Yu,et al.  Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[29]  Tianlong Chen,et al.  L2-GCN: Layer-Wise and Learned Efficient Training of Graph Convolutional Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Danny Z. Chen,et al.  A Hierarchical Graph Network for 3D Object Detection on Point Clouds , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Shaogang Gong,et al.  Image Search With Text Feedback by Visiolinguistic Attention Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Xiu-Shen Wei,et al.  Multi-Label Image Recognition With Graph Convolutional Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Gunhee Kim,et al.  Cycled Compositional Learning between Images and Text , 2021, ArXiv.

[34]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[35]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Rogério Schmidt Feris,et al.  Dialog-based Interactive Image Retrieval , 2018, NeurIPS.

[38]  Yang Zhang,et al.  Modality-Agnostic Attention Fusion for visual search with text feedback , 2020, ArXiv.

[39]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[40]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .