Learning Visually-Grounded Semantics from Contrastive Adversarial Samples

We study the problem of grounding distributional representations of texts on the visual domain, namely visual-semantic embeddings (VSE for short). Begin with an insightful adversarial attack on VSE embeddings, we show the limitation of current frameworks and image-text datasets (e.g., MS-COCO) both quantitatively and qualitatively. The large gap between the number of possible constitutions of real-world semantics and the size of parallel data, to a large extent, restricts the model to establish a strong link between textual semantics and visual concepts. We alleviate this problem by augmenting the MS-COCO image captioning datasets with textual contrastive adversarial samples. These samples are synthesized using language priors of human and the WordNet knowledge base, and enforce the model to ground learned embeddings to concrete concepts within the image. This simple but powerful technique brings a noticeable improvement over the baselines on a diverse set of downstream tasks, in addition to defending known-type adversarial attacks. Codes are available at https://github.com/ExplorerFreda/VSE-C.

[1]  Christopher D. Manning,et al.  Bilingual Word Embeddings for Phrase-Based Machine Translation , 2013, EMNLP.

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  Hang Li Learning to Rank for Information Retrieval and Natural Language Processing , 2011, Synthesis Lectures on Human Language Technologies.

[4]  Alan L. Yuille,et al.  Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images , 2016, NIPS.

[5]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[7]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[8]  Mirella Lapata,et al.  Long Short-Term Memory-Networks for Machine Reading , 2016, EMNLP.

[9]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[10]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[11]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  David J. Fleet,et al.  VSE++: Improved Visual-Semantic Embeddings , 2017, ArXiv.

[15]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[16]  Yair Neuman,et al.  Literal and Metaphorical Sense Identification through Concrete and Abstract Context , 2011, EMNLP.

[17]  Falk Scholer,et al.  User performance versus precision measures for simple search tasks , 2006, SIGIR.

[18]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[19]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[20]  Yelong Shen,et al.  ReasoNet: Learning to Stop Reading in Machine Comprehension , 2016, CoCo@NIPS.

[21]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[23]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[25]  Wei Wang,et al.  Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Antonio Torralba,et al.  Learning Aligned Cross-Modal Representations from Weakly Aligned Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[28]  Danqi Chen,et al.  A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task , 2016, ACL.

[29]  Dawn Xiaodong Song,et al.  Delving into adversarial attacks on deep policies , 2017, ICLR.

[30]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[31]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[33]  Gang Hua,et al.  Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Jian Zhang,et al.  Natural Language Inference over Interaction Space , 2017, ICLR.

[35]  Leonidas J. Guibas,et al.  Joint embeddings of shapes and images via CNN image purification , 2015, ACM Trans. Graph..

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[38]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Mark Johnson,et al.  An Improved Non-monotonic Transition System for Dependency Parsing , 2015, EMNLP.

[40]  Alan L. Yuille,et al.  Adversarial Examples for Semantic Segmentation and Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).