Counterfactual VQA: A Cause-Effect Look at Language Bias

VQA models may tend to rely on language bias as a shortcut and thus fail to sufficiently learn the multi-modal knowledge from both vision and language. Recent debiasing methods proposed to exclude the language prior during inference. However, they fail to disentangle the "good" language context and "bad" language bias from the whole. In this paper, we investigate how to mitigate language bias in VQA. Motivated by causal effects, we proposed a novel counterfactual inference framework, which enables us to capture the language bias as the direct causal effect of questions on answers and reduce the language bias by subtracting the direct language effect from the total causal effect. Experiments demonstrate that our proposed counterfactual inference framework 1) is general to various VQA backbones and fusion strategies, 2) achieves competitive performance on the language-bias sensitive VQA-CP dataset while performs robustly on the balanced VQA v2 dataset without any augmented data. The code is available at https://github.com/yuleiniu/cfvqa.

[1]  Jianfei Cai,et al.  Causal Attention for Vision-Language Tasks , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Chunyan Miao,et al.  Distilling Causal Effect of Data in Class-Incremental Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xian-Sheng Hua,et al.  Counterfactual Zero-Shot and Open-Set Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Hanwang Zhang,et al.  Deconfounded Image Captioning: A Causal Retrospect , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Weitao Jiang,et al.  Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering , 2020, EMNLP.

[6]  Xian-Sheng Hua,et al.  Interventional Few-Shot Learning , 2020, NeurIPS.

[7]  Jinhui Tang,et al.  Causal Intervention for Weakly-Supervised Semantic Segmentation , 2020, NeurIPS.

[8]  William Yang Wang,et al.  Iterative Language-Based Image Editing via Self-Supervised Counterfactual Reasoning , 2020, EMNLP.

[9]  Chitta Baral,et al.  MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering , 2020, EMNLP.

[10]  Zhou Zhao,et al.  DeVLBert: Learning Deconfounded Visio-Linguistic Representations , 2020, ACM Multimedia.

[11]  Anurag Mittal,et al.  Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder , 2020, ECCV.

[12]  Yongdong Zhang,et al.  Overcoming Language Priors with Self-supervised Learning for Visual Question Answering , 2020, IJCAI.

[13]  Anton van den Hengel,et al.  Counterfactual Vision and Language Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Anton van den Hengel,et al.  On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law , 2020, NeurIPS.

[15]  Anton van den Hengel,et al.  Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision , 2020, ECCV.

[16]  Nuno Vasconcelos,et al.  SCOUT: Self-Aware Discriminant Counterfactual Explanations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yunde Jia,et al.  Overcoming Language Priors in VQA via Decomposed Linguistic Representations , 2020, AAAI.

[18]  Shiliang Pu,et al.  Counterfactual Samples Synthesizing for Robust Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Hanwang Zhang,et al.  Visual Commonsense R-CNN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Anton van den Hengel,et al.  Unshuffling Data for Improved Generalization in Visual Question Answering , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Jianqiang Huang,et al.  Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Peter M. Aronow,et al.  The Book of Why: The New Science of Cause and Effect , 2020, Journal of the American Statistical Association.

[23]  Mario Fritz,et al.  Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Hanwang Zhang,et al.  Two Causal Principles for Improving Visual Dialog , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  J. Tenenbaum,et al.  CLEVRER: CoLlision Events for Video REpresentation and Reasoning , 2019, ICLR.

[26]  Luke Zettlemoyer,et al.  Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[27]  Matthieu Cord,et al.  RUBi: Reducing Unimodal Biases in Visual Question Answering , 2019, NeurIPS.

[28]  Yonatan Belinkov,et al.  Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects , 2019, Proceedings of the Second Workshop on Shortcomings in Vision and Language.

[29]  Raymond J. Mooney,et al.  Self-Critical Reasoning for Robust Visual Question Answering , 2019, NeurIPS.

[30]  David Sontag,et al.  Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models , 2019, ICML.

[31]  Jianfei Cai,et al.  Learning to Collocate Neural Modules for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Ziyan Wu,et al.  Counterfactual Visual Explanations , 2019, ICML.

[33]  Shu Kong,et al.  Modularized Textual Grounding for Counterfactual Resilience , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Matthieu Cord,et al.  MUREL: Multimodal Relational Reasoning for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Hongxia Jin,et al.  Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Matthieu Cord,et al.  BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection , 2019, AAAI.

[37]  Long Chen,et al.  Counterfactual Critic Multi-Agent Training for Scene Graph Generation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Tatsuya Harada,et al.  Multimodal Explanations by Predicting Counterfactuality in Videos , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Larry S. Davis,et al.  Explicit Bias Discovery in Visual Question Answering Models , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Stefan Lee,et al.  Overcoming Language Priors in Visual Question Answering with Adversarial Regularization , 2018, NeurIPS.

[42]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[43]  Mélanie Frappier,et al.  The Book of Why: The New Science of Cause and Effect , 2018, Science.

[44]  Trevor Darrell,et al.  Grounding Visual Explanations , 2018, ECCV.

[45]  Anima Anandkumar,et al.  Question Type Guided Attention in Visual Question Answering , 2018, ECCV.

[46]  Mihaela van der Schaar,et al.  GANITE: Estimation of Individualized Treatment Effects using Generative Adversarial Nets , 2018, ICLR.

[47]  Trevor Darrell,et al.  Multimodal Explanations: Justifying Decisions and Pointing to the Evidence , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Aidong Zhang,et al.  Representation Learning for Treatment Effect Estimation from Observational Data , 2018, NeurIPS.

[52]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Christopher Kanan,et al.  An Analysis of Visual Question Answering Algorithms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Uri Shalit,et al.  Estimating individual treatment effect: generalization bounds and algorithms , 2016, ICML.

[57]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[58]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[59]  Dhruv Batra,et al.  Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[60]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[61]  Yash Goyal,et al.  Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[64]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[65]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[66]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2012, J. Mach. Learn. Res..

[67]  Martin Rohleder,et al.  Survivorship Bias and Mutual Fund Performance: Relevance, Significance, and Methodical Differences , 2010 .

[68]  V. Chernozhukov,et al.  Inference on Counterfactual Distributions , 2009, 0904.0951.

[69]  Michael H. Bowling,et al.  Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[70]  Mark J van der Laan,et al.  Estimation of Direct Causal Effects , 2006, Epidemiology.

[71]  Judea Pearl,et al.  Direct and Indirect Effects , 2001, UAI.

[72]  J. Robins,et al.  Marginal Structural Models and Causal Inference in Epidemiology , 2000, Epidemiology.

[73]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[74]  J. Robins A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect , 1986 .

[75]  Donald B. Rubin,et al.  Bayesian Inference for Causal Effects: The Role of Randomization , 1978 .