Towards learning domain-general representations for language from multi-modal data
暂无分享,去创建一个
Recurrent neural networks (RNN) have gained a reputation for producing state-of-the-art results on many NLP tasks and for producing representations of words, phrases and larger linguistic units that encode complex syntactic and semantic structures. Recently these types of models have also been used extensively to deal with multi-modal data e.g.: to solve such problems as caption generation or automatic video description. The contribution of our present work is two-fold: a) we propose a novel multi-modal recurrent neural network architecture for domain-general representation learning, b) and a number of methods that “open up the black box”” and shed light on what kind of linguistic knowledge the network learns. We propose IMAGINET, a RNN architecture that learns visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with a shared word embedding matrix. It uses a multi-task objective by receiving a textual description of a scene and concurrently predicting its visual representation (extracted from images using a CNN trained on ImageNet) and the next word in the sentence. Moreover, we perform two exploratory analyses: a) show that the model learns to effectively use sequential structure in semantic interpretation and b) propose two methods to explore the importance of grammatical categories with respect to the model and the task. We observe that the model pays most attention to head-words, noun subjects and adjectival modifiers and least to determiners and prepositions.