Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering

We propose a new class of probabilistic neural-symbolic models, that have symbolic functional programs as a latent, stochastic variable. Instantiated in the context of visual question answering, our probabilistic formulation offers two key conceptual advantages over prior neural-symbolic models for VQA. Firstly, the programs generated by our model are more understandable while requiring lesser number of teaching examples. Secondly, we show that one can pose counterfactual scenarios to the model, to probe its beliefs on the programs that could lead to a specified answer given an image. Our results on the CLEVR and SHAPES datasets verify our hypotheses, showing that the model gets better program (and answer) prediction accuracy even in the low data regime, and allows one to probe the coherence and consistency of reasoning performed.

[1]  Quoc V. Le,et al.  Neural Programmer: Inducing Latent Programs with Gradient Descent , 2015, ICLR.

[2]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[3]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[4]  Phil Blunsom,et al.  Language as a Latent Variable: Discrete Generative Models for Sentence Compression , 2016, EMNLP.

[5]  Léon Bottou,et al.  From machine learning to machine reasoning , 2011, Machine Learning.

[6]  Swarat Chaudhuri,et al.  HOUDINI: Lifelong Learning as Program Synthesis , 2018, NeurIPS.

[7]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[8]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[9]  Trevor Darrell,et al.  Explainable Neural Computation via Stack Neural Module Networks , 2018, ECCV.

[10]  Alexander A. Alemi,et al.  Fixing a Broken ELBO , 2017, ICML.

[11]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[12]  Richard Evans,et al.  Can Neural Networks Understand Logical Entailment? , 2018, ICLR.

[13]  Masahiro Suzuki,et al.  Joint Multimodal Learning with Deep Generative Models , 2016, ICLR.

[14]  Herbert A. Simon,et al.  Computer Science as Empirical Inquiry , 2011 .

[15]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[17]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[18]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[23]  Graham Neubig,et al.  StructVAE: Tree-structured Latent Variable Models for Semi-supervised Semantic Parsing , 2018, ACL.

[24]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[25]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Percy Liang,et al.  From Language to Programs: Bridging Reinforcement Learning and Maximum Marginal Likelihood , 2017, ACL.

[28]  Stefan Lee,et al.  Neural Modular Control for Embodied Question Answering , 2018, CoRL.

[29]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[30]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[31]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[32]  Chuang Gan,et al.  Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[33]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[35]  Allen Newell,et al.  Computer science as empirical inquiry: symbols and search , 1976, CACM.

[36]  Stefan Lee,et al.  EvalAI: Towards Better Evaluation Systems for AI Agents , 2019, ArXiv.

[37]  Leslie G. Valiant,et al.  Three problems in computer science , 2003, JACM.

[38]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[39]  Chuang Gan,et al.  The Neuro-Symbolic Concept Learner: Interpreting Scenes Words and Sentences from Natural Supervision , 2019, ICLR.

[40]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[41]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[42]  Kevin Murphy,et al.  Generative Models of Visually Grounded Imagination , 2017, ICLR.

[43]  J. Tenenbaum A Bayesian framework for concept learning , 1999 .

[44]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[45]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[46]  Ankur Taly,et al.  Did the Model Understand the Question? , 2018, ACL.

[47]  Trevor Darrell,et al.  Multimodal Explanations: Justifying Decisions and Pointing to the Evidence , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  David Mascharka,et al.  Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Siddhartha S. Srinivasa,et al.  Legibility and predictability of robot motion , 2013, 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[50]  Sumit Gulwani,et al.  Neural-Guided Deductive Search for Real-Time Program Synthesis from Examples , 2018, ICLR.

[51]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[52]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[54]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[55]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[56]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Marc Brockschmidt,et al.  Lifelong Perceptual Programming By Example , 2016, ICLR.