Image-Enhanced Multi-level Sentence Representation Net for Natural Language Inference

Natural Language Inference (NLI) task requires an agent to determine the semantic relation between a premise sentence (p) and a hypothesis sentence (h), which demands sufficient understanding about sentences from lexical knowledge to global semantic. Due to the issues such as polysemy, ambiguity, as well as fuzziness of sentences, fully understanding sentences is still challenging. To this end, we propose an Image-Enhanced Multi-Level Sentence Representation Net (IEMLRN), a novel architecture that is able to utilize the image to enhance the sentence semantic understanding at different scales. To be specific, we introduce the corresponding image of sentences as reference information, which can be helpful for sentence semantic understanding and inference relation evaluation. Since image information might be related to the sentence semantics at different scales, we design a multi-level architecture to understand sentences from different granularity and generate the sentence representation more precisely. Experimental results on the large-scale NLI corpus and real-world NLI alike corpus demonstrate that IEMLRN can simultaneously improve the performance. It is noteworthy that IEMLRN significantly outperforms the state-of-the-art sentence-encoding based models on the challenging hard subset and challenging lexical subset of SNLI corpus.

[1]  Raffaella Bernardi,et al.  Entailment above the word level in distributional semantics , 2012, EACL.

[2]  Sungzoon Cho,et al.  Distance-based Self-Attention Network for Natural Language Inference , 2017, ArXiv.

[3]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[6]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[7]  Jian Zhang,et al.  Natural Language Inference over Interaction Space , 2017, ICLR.

[8]  Ido Dagan,et al.  Directional distributional similarity for lexical inference , 2010, Natural Language Engineering.

[9]  Hong Yu,et al.  Neural Tree Indexers for Text Understanding , 2016, EACL.

[10]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[11]  Rui Yan,et al.  Natural Language Inference by Tree-Based Convolution and Heuristic Matching , 2015, ACL.

[12]  Yi Zheng,et al.  Reading the Videos: Temporal Labeling for Crowdsourced Time-Sync Videos Based on Semantic Embedding , 2016, AAAI.

[13]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[16]  David J. Weir,et al.  Learning to Distinguish Hypernyms and Co-Hyponyms , 2014, COLING.

[17]  Christopher Potts,et al.  A Fast Unified Model for Parsing and Sentence Understanding , 2016, ACL.

[18]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[19]  Oren Etzioni,et al.  Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions , 2016, AAAI.

[20]  Yoav Goldberg,et al.  Breaking NLI Systems with Sentences that Require Simple Lexical Inferences , 2018, ACL.

[21]  Qi Wu,et al.  The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Enhong Chen,et al.  Exploring the Emerging Type of Comment for Online Videos , 2017, ACM Trans. Web.

[23]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yanjie Fu,et al.  Fake News Detection with Deep Diffusive Network Model , 2018, ArXiv.

[25]  Mirella Lapata,et al.  Long Short-Term Memory-Networks for Machine Reading , 2016, EMNLP.

[26]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[27]  Enhong Chen,et al.  Finding Similar Exercises in Online Education Systems , 2018, KDD.

[28]  Yang Liu,et al.  Learning Natural Language Inference using Bidirectional LSTM model and Inner-Attention , 2016, ArXiv.

[29]  Enhong Chen,et al.  A Context-Enriched Neural Network Method for Recognizing Lexical Entailment , 2017, AAAI.

[30]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[31]  Phil Blunsom,et al.  Reasoning about Entailment with Neural Attention , 2015, ICLR.

[32]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[33]  Zhen-Hua Ling,et al.  Recurrent Neural Network-Based Sentence Encoder with Gated Attention for Natural Language Inference , 2017, RepEval@EMNLP.

[34]  Philip S. Yu,et al.  Multi-view collective tensor decomposition for cross-modal hashing , 2018, International Journal of Multimedia Information Retrieval.

[35]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[36]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[37]  Yoshua Bengio,et al.  Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks , 2015, IEEE Transactions on Multimedia.

[38]  Zhiguo Wang,et al.  Bilateral Multi-Perspective Matching for Natural Language Sentences , 2017, IJCAI.

[39]  Christopher D. Manning,et al.  Natural language inference , 2009 .

[40]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[41]  Song Wu,et al.  A Double-Layer Neural Network Framework for High-Frequency Forecasting , 2017, ACM Trans. Manag. Inf. Syst..

[42]  C. K. Ogden,et al.  Basic English : a general introduction with rules and grammar , 1930 .

[43]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[44]  Grgoire Montavon,et al.  Neural Networks: Tricks of the Trade , 2012, Lecture Notes in Computer Science.

[45]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[46]  Tao Shen,et al.  DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding , 2017, AAAI.

[47]  Saif Mohammad,et al.  Experiments with three approaches to recognizing lexical entailment , 2014, Natural Language Engineering.

[48]  Yonatan Bisk,et al.  Natural Language Inference from Multiple Premises , 2017, IJCNLP.

[49]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.