A dataset and architecture for visual reasoning with a working memory

A vexing problem in artificial intelligence is reasoning about events that occur in complex, changing visual stimuli such as in video analysis or game play. Inspired by a rich tradition of visual reasoning and memory in cognitive psychology and neuroscience, we developed an artificial, configurable visual question and answer dataset (COG) to parallel experiments in humans and animals. COG is much simpler than the general problem of video analysis, yet it addresses many of the problems relating to visual and logical reasoning and memory -- problems that remain challenging for modern deep learning architectures. We additionally propose a deep learning architecture that performs competitively on other diagnostic VQA datasets (i.e. CLEVR) as well as easy settings of the COG dataset. However, several settings of COG result in datasets that are progressively more challenging to learn. After training, the network can zero-shot generalize to many new tasks. Preliminary analyses of the network architectures trained on COG demonstrate that the network accomplishes the task in a manner interpretable to humans.

[1]  P. Zelazo,et al.  An age-related dissociation between knowing rules and using them ☆ , 1996 .

[2]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[4]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[5]  H. Francis Song,et al.  Clustering and compositionality of task representations in a neural network trained to perform many cognitive tasks , 2017, bioRxiv.

[6]  Jonathon Shlens,et al.  A Learned Representation For Artistic Style , 2016, ICLR.

[7]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[8]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[9]  Alexander Kuhnle,et al.  ShapeWorld - A new test methodology for multimodal language understanding , 2017, ArXiv.

[10]  Xiao-Jing Wang,et al.  The importance of mixed selectivity in complex cognitive tasks , 2013, Nature.

[11]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[14]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  D B YNTEMA,et al.  Keeping Track of Several Things at Once1 , 1963, Human factors.

[17]  W. Newsome,et al.  Context-dependent computation by recurrent dynamics in prefrontal cortex , 2013, Nature.

[18]  Dhruv Batra,et al.  Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[19]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[21]  Alex Graves,et al.  Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[22]  K. Miller Executive functions. , 2005, Pediatric annals.

[23]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[25]  M. J. Emerson,et al.  The Unity and Diversity of Executive Functions and Their Contributions to Complex “Frontal Lobe” Tasks: A Latent Variable Analysis , 2000, Cognitive Psychology.

[26]  K. H. Britten,et al.  Neuronal correlates of a perceptual decision , 1989, Nature.

[27]  R. Desimone,et al.  Neural Mechanisms of Visual Working Memory in Prefrontal Cortex of the Macaque , 1996, The Journal of Neuroscience.

[28]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[30]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[31]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[32]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[33]  Kathryn M. McMillan,et al.  N‐back working memory paradigm: A meta‐analysis of normative functional neuroimaging studies , 2005, Human brain mapping.

[34]  Terry Winograd,et al.  Understanding natural language , 1974 .

[35]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[36]  R. Andersen,et al.  Multimodal representation of space in the posterior parietal cortex and its use in planning movements. , 1997, Annual review of neuroscience.

[37]  E. A. Berg,et al.  A simple objective technique for measuring flexibility in thinking. , 1948, The Journal of general psychology.

[38]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[39]  Emilio Salinas,et al.  Cognitive neuroscience: Flutter Discrimination: neural codes, perception, memory and decision making , 2003, Nature Reviews Neuroscience.

[40]  Jascha Sohl-Dickstein,et al.  Capacity and Trainability in Recurrent Neural Networks , 2016, ICLR.

[41]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[42]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Bob L. Sturm A Simple Method to Determine if a Music Information Retrieval System is a “Horse” , 2014, IEEE Transactions on Multimedia.

[44]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[45]  D. Hassabis,et al.  Neuroscience-Inspired Artificial Intelligence , 2017, Neuron.

[46]  Michael W. Cole,et al.  Rapid instructed task learning: A new window into the human brain’s unique capacity for flexible cognitive control , 2013, Cognitive, affective & behavioral neuroscience.

[47]  E. Miller,et al.  An integrative theory of prefrontal cortex function. , 2001, Annual review of neuroscience.

[48]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[49]  B. Milner Effects of Different Brain Lesions on Card Sorting: The Role of the Frontal Lobes , 1963 .

[50]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[52]  M. D’Esposito Working memory. , 2008, Handbook of clinical neurology.

[53]  Tomas Mikolov,et al.  Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets , 2015, NIPS.

[54]  Edward K. Vogel,et al.  The capacity of visual working memory for features and conjunctions , 1997, Nature.

[55]  C. Lawrence Zitnick,et al.  Bringing Semantics into Focus Using Visual Abstraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Tom Schaul,et al.  StarCraft II: A New Challenge for Reinforcement Learning , 2017, ArXiv.

[57]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[58]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..