What If We Simply Swap the Two Text Fragments? A Straightforward yet Effective Way to Test the Robustness of Methods to Confounding Signals in Nature Language Inference Tasks

Nature language inference (NLI) task is a predictive task of determining the inference relationship of a pair of natural language sentences. With the increasing popularity of NLI, many state-of-the-art predictive models have been proposed with impressive performances. However, several works have noticed the statistical irregularities in the collected NLI data set that may result in an over-estimated performance of these models and proposed remedies. In this paper, we further investigate the statistical irregularities, what we refer as confounding factors, of the NLI data sets. With the belief that some NLI labels should preserve under swapping operations, we propose a simple yet effective way (swapping the two text fragments) of evaluating the NLI predictive models that naturally mitigate the observed problems. Further, we continue to train the predictive models with our swapping manner and propose to use the deviation of the model's evaluation performances under different percentages of training text fragments to be swapped to describe the robustness of a predictive model. Our evaluation metrics leads to some interesting understandings of recent published NLI methods. Finally, we also apply the swapping operation on NLI models to see the effectiveness of this straightforward method in mitigating the confounding factor problems in training generic sentence embeddings for other NLP transfer tasks.

[1]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[2]  Siu Cheung Hui,et al.  Compare, Compress and Propagate: Enhancing Neural Architectures with Alignment Factorization for Natural Language Inference , 2017, EMNLP.

[3]  Mohit Bansal,et al.  Shortcut-Stacked Sentence Encoders for Multi-Domain Inference , 2017, RepEval@EMNLP.

[4]  Chengqi Zhang,et al.  Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling , 2018, IJCAI.

[5]  Eric P. Xing,et al.  Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis , 2016, ArXiv.

[6]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[7]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[8]  Hong Yu,et al.  Neural Tree Indexers for Text Understanding , 2016, EACL.

[9]  Zhen-Hua Ling,et al.  Neural Natural Language Inference Models Enhanced with External Knowledge , 2017, ACL.

[10]  Christopher Potts,et al.  A Fast Unified Model for Parsing and Sentence Understanding , 2016, ACL.

[11]  Zhen-Hua Ling,et al.  Enhancing Sentence Embedding with Generalized Pooling , 2018, COLING.

[12]  Marcus A. Badgeley,et al.  Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study , 2018, PLoS medicine.

[13]  Ming Zhou,et al.  Multiway Attention Networks for Modeling Sentence Pairs , 2018, IJCAI.

[14]  Zhen-Hua Ling,et al.  Recurrent Neural Network-Based Sentence Encoder with Gated Attention for Natural Language Inference , 2017, RepEval@EMNLP.

[15]  Eric P. Xing,et al.  Removing Confounding Factors Associated Weights in Deep Neural Networks Improves the Prediction Accuracy for Healthcare Applications , 2018 .

[16]  Haohan Wang,et al.  Deep Learning for Genomics: A Concise Overview , 2018, ArXiv.

[17]  Eric P. Xing,et al.  Removing Confounding Factors Associated Weights in Deep Neural Networks Improves the Prediction Accuracy for Healthcare Applications , 2018, bioRxiv.

[18]  Siu Cheung Hui,et al.  A Compare-Propagate Architecture with Alignment Factorization for Natural Language Inference , 2017, ArXiv.

[19]  Pengfei Liu,et al.  Modelling Interaction of Sentence Pair with Coupled-LSTMs , 2016, EMNLP.

[20]  Yang Liu,et al.  Learning Natural Language Inference using Bidirectional LSTM model and Inner-Attention , 2016, ArXiv.

[21]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[22]  Eric P. Xing,et al.  Select-additive learning: Improving generalization in multimodal sentiment analysis , 2016, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[23]  Jihun Choi,et al.  Learning to Compose Task-Specific Tree Structures , 2017, AAAI.

[24]  Jin-Hyuk Hong,et al.  Semantic Sentence Matching with Densely-connected Recurrent and Co-attentive Information , 2018, AAAI.

[25]  Shuohang Wang,et al.  Learning Natural Language Inference with LSTM , 2015, NAACL.

[26]  Phil Blunsom,et al.  Reasoning about Entailment with Neural Attention , 2015, ICLR.

[27]  Zhifang Sui,et al.  Reading and Thinking: Re-read LSTM Unit for Textual Entailment Recognition , 2016, COLING.

[28]  Pasquale Minervini,et al.  Adversarially Regularising Neural NLI Models to Integrate Logical Background Knowledge , 2018, CoNLL.

[29]  Marcus A. Badgeley,et al.  Confounding variables can degrade generalization performance of radiological deep learning models , 2018, ArXiv.

[30]  Xiaodong Liu,et al.  Stochastic Answer Networks for Natural Language Inference , 2018, ArXiv.

[31]  Christopher Joseph Pal,et al.  Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning , 2018, ICLR.

[32]  Xuanjing Huang,et al.  Deep Fusion LSTMs for Text Semantic Matching , 2016, ACL.

[33]  Mirella Lapata,et al.  Long Short-Term Memory-Networks for Machine Reading , 2016, EMNLP.

[34]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[35]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[36]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[37]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[38]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Marco Marelli,et al.  A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[40]  Zhen-Hua Ling,et al.  Enhancing and Combining Sequential and Tree LSTM for Natural Language Inference , 2016, ArXiv.

[41]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.