Procedural Reasoning Networks for Understanding Multimodal Procedures

This paper addresses the problem of comprehending procedural commonsense knowledge. This is a challenging task as it requires identifying key entities, keeping track of their state changes, and understanding temporal and causal relations. Contrary to most of the previous work, in this study, we do not rely on strong inductive bias and explore the question of how multimodality can be exploited to provide a complementary semantic signal. Towards this end, we introduce a new entity-aware neural comprehension model augmented with external relational memory units. Our model learns to dynamically update entity states in relation to each other while reading the text instructions. Our experimental analysis on the visual reasoning tasks in the recently proposed RecipeQA dataset reveals that our approach improves the accuracy of the previously reported models by a large margin. Moreover, we find that our model learns effective dynamic representations of entities even though we do not use any supervision at the level of entity states.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  David A. McAllester,et al.  Emergent Predication Structure in Hidden State Vectors of Neural Readers , 2016, Rep4NLP@ACL.

[3]  Yoshua Bengio,et al.  FigureQA: An Annotated Figure Dataset for Visual Reasoning , 2017, ICLR.

[4]  Larry S. Davis,et al.  The Amazing Mysteries of the Gutter: Drawing Inferences Between Panels in Comic Book Narratives , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ruslan Salakhutdinov,et al.  Neural Models for Reasoning over Multiple Mentions Using Coreference , 2018, NAACL.

[6]  Xinya Du,et al.  Be Consistent! Improving Procedural Text Comprehension using Label Consistency , 2019, NAACL.

[7]  Rajarshi Das,et al.  Building Dynamic Knowledge Graphs from Text using Machine Reading Comprehension , 2018, ICLR.

[8]  Danqi Chen,et al.  A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task , 2016, ACL.

[9]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[10]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Razvan Pascanu,et al.  Relational recurrent neural networks , 2018, NeurIPS.

[12]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[13]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[14]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[15]  Omer Levy,et al.  Simulating Action Dynamics with Neural Process Networks , 2017, ICLR.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[20]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[23]  Fei Liu,et al.  Dialog state tracking, a machine reading approach using Memory Network , 2016, EACL.

[24]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[25]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[26]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[27]  Bhavana Dalvi,et al.  Reasoning about Actions and State Changes by Injecting Commonsense Knowledge , 2018, EMNLP.

[28]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[29]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[30]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yejin Choi,et al.  Dynamic Entity Representations in Neural Language Models , 2017, EMNLP.

[32]  Jonghyun Choi,et al.  Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Andrew McCallum,et al.  RelNet: End-to-end Modeling of Entities & Relations , 2017, AKBC@NIPS.

[34]  Antonio Torralba,et al.  Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[36]  Héctor Allende,et al.  Working Memory Networks: Augmenting Memory Networks with a Relational Reasoning Module , 2018, ACL.

[37]  Ali Farhadi,et al.  Query-Reduction Networks for Question Answering , 2016, ICLR.

[38]  Jason Weston,et al.  Tracking the World State with Recurrent Entity Networks , 2016, ICLR.

[39]  Bhavana Dalvi,et al.  Tracking State Changes in Procedural Text: a Challenge Dataset and Models for Process Paragraph Comprehension , 2018, NAACL.

[40]  Nazli Ikizler-Cinbis,et al.  RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes , 2018, EMNLP.