论文信息 - Cascaded Mutual Modulation for Visual Reasoning

Cascaded Mutual Modulation for Visual Reasoning

Visual reasoning is a special visual question answering problem that is multi-step and compositional by nature, and also requires intensive text-vision interactions. We propose CMM: Cascaded Mutual Modulation as a novel end-to-end visual reasoning model. CMM includes a multi-step comprehension process for both question and image. In each step, we use a Feature-wise Linear Modulation (FiLM) technique to enable textual/visual pipeline to mutually control each other. Experiments show that CMM significantly outperforms most related models, and reach state-of-the-arts on two visual reasoning benchmarks: CLEVR and NLVR, collected from both synthetic and natural languages. Ablation studies confirm that both our multistep framework and our visual-guided language modulation are critical to the task. Our code is available at this https URL.

[1] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2] Dan Klein,et al. Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Li Fei-Fei,et al. Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4] Razvan Pascanu,et al. A simple neural network module for relational reasoning , 2017, NIPS.

[5] Jung-Woo Ha,et al. Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Trevor Darrell,et al. Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7] Christopher D. Manning,et al. Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[8] Yoav Artzi,et al. A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[9] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Hao Tan,et al. Object Ordering with Bidirectional Matchings for Visual Reasoning , 2018, NAACL-HLT.

[11] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[12] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Justin Johnson,et al. DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer , 2018, ArXiv.

[14] Xiao-Jing Wang,et al. A dataset and architecture for visual reasoning with a working memory , 2018, ECCV.

[15] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[16] Aaron C. Courville,et al. FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[17] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[18] David Mascharka,et al. Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19] Dan Klein,et al. Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[20] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22] Hugo Larochelle,et al. Modulating early visual processing by language , 2017, NIPS.