Explainability by Parsing: Neural Module Tree Networks for Natural Language Visual Grounding

Grounding natural language in images essentially requires composite visual reasoning. However, existing methods over-simplify the composite nature of language into a monolithic sentence embedding or a coarse composition of subject-predicate-object triplet. They might perform well on short phrases, but generally fail in longer sentences, mainly due to the over-fitting to certain vision-language bias. In this paper, we propose to ground natural language in an intuitive, explainable, and composite fashion as it should be. In particular, we develop a novel modular network called Neural Module Tree network (NMTree) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a module network that calculates or accumulates the grounding score in a bottom-up direction where as needed. NMTree disentangles the visual grounding from the composite reasoning, allowing the former to only focus on primitive and easy-to-generalize patterns. To reduce the impact of parsing errors, we train the modules and their assembly end-to-end by using the Gumbel-Softmax approximation and its straight-through gradient estimator, accounting for the discrete process of module selection. Overall, the proposed NMTree not only consistently outperforms the state-of-the-arts on several benchmarks and tasks, but also shows explainable reasoning in grounding score calculation. Therefore, NMTree shows a good direction in closing the gap between explainability and performance.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4]  Qi Wu,et al.  Visual Grounding via Accumulated Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Svetlana Lazebnik,et al.  Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Serge J. Belongie,et al.  Convolutional Networks with Adaptive Inference Graphs , 2017, International Journal of Computer Vision.

[7]  Louis-Philippe Morency,et al.  Using Syntax to Ground Referring Expressions in Natural Images , 2018, AAAI.

[8]  Liang Lin,et al.  Visual Question Reasoning on General Dependency Tree , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[10]  B. Brookes,et al.  Statistical Theory of Extreme Values and Some Practical Applications , 1955, The Mathematical Gazette.

[11]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Liang Wang,et al.  Referring Expression Generation and Comprehension via Attributes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Trevor Darrell,et al.  Explainable Neural Computation via Stack Neural Module Networks , 2018, ECCV.

[17]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  R. Likert “Technique for the Measurement of Attitudes, A” , 2022, The SAGE Encyclopedia of Research Design.

[19]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[22]  Chuang Gan,et al.  Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[23]  Jianwei Yang,et al.  Neural Baby Talk , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[25]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[26]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Gregory Shakhnarovich,et al.  Comprehension-Guided Referring Expressions , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yue Zhang,et al.  Fast and Accurate Shift-Reduce Constituent Parsing , 2013, ACL.

[29]  Larry S. Davis,et al.  Modeling Context Between Objects for Referring Expression Understanding , 2016, ECCV.

[30]  Shih-Fu Chang,et al.  Grounding Referring Expressions in Images by Variational Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[33]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[34]  Licheng Yu,et al.  A Joint Speaker-Listener-Reinforcer Model for Referring Expressions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Matti Pietikäinen,et al.  Deep Learning for Generic Object Detection: A Survey , 2018, International Journal of Computer Vision.

[36]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[37]  Trevor Darrell,et al.  Learning to Segment Every Thing , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).