Lingusitic Analysis of Multi-Modal Recurrent Neural Networks

Recurrent neural networks (RNN) have gained a reputation for beating state-of-the-art results on many NLP benchmarks and for learning representations of words and larger linguistic units that encode complex syntactic and semantic structures. However, it is not straight-forward to understand how exactly these models make their decisions. Recently Li et al. (2015) developed methods to provide linguistically motivated analysis for RNNs trained for sentiment analysis. Here we focus on the analysis of a multi-modal Gated Recurrent Neural Network (GRU) architecture trained to predict image-vectors extracted from images using a CNN trained on ImageNet from their corresponding descriptions. We propose two methods to explore the importance of grammatical categories with respect to the model and the task. We observe that the model pays most attention to head-words, noun subjects and adjectival modifiers and least to determiners and coordinations.