Associative Multichannel Autoencoder for Multimodal Word Representation

In this paper we address the problem of learning multimodal word representations by integrating textual, visual and auditory inputs. Inspired by the re-constructive and associative nature of human memory, we propose a novel associative multichannel autoencoder (AMA). Our model first learns the associations between textual and perceptual modalities, so as to predict the missing perceptual information of concepts. Then the textual and predicted perceptual representations are fused through reconstructing their original and associated embeddings. Using a gating mechanism our model assigns different weights to each modality according to the different concepts. Results on six benchmark concept similarity tests show that the proposed method significantly outperforms strong unimodal baselines and state-of-the-art multimodal models.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[3]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[4]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[5]  Felix Hill,et al.  Multi-Modal Models for Concrete and Abstract Concept Meaning , 2014, TACL.

[6]  T. Rogers,et al.  The neural and computational bases of semantic cognition , 2016, Nature Reviews Neuroscience.

[7]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[8]  Linda B. Smith,et al.  Object perception and object naming in early development , 1998, Trends in Cognitive Sciences.

[9]  Marina Schmid,et al.  Imagery And Verbal Processes , 2016 .

[10]  Carina Silberer,et al.  Learning Grounded Meaning Representations with Autoencoders , 2014, ACL.

[11]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[12]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[13]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[14]  G. Bower,et al.  Human Associative Memory , 1973 .

[15]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Sabine Schulte im Walde,et al.  A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities , 2013, EMNLP.

[17]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[18]  Benjamin Naumann,et al.  Mental Representations A Dual Coding Approach , 2016 .

[19]  L. Barsalou Grounded cognition. , 2008, Annual review of psychology.

[20]  Allan Collins,et al.  A spreading-activation theory of semantic processing , 1975 .

[21]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[22]  Felix Hill,et al.  Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can’t See What I Mean , 2014, EMNLP.

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Felix Hill,et al.  SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity , 2016, EMNLP.

[25]  Gabriella Vigliocco,et al.  Integrating experiential and distributional data to learn semantic representations. , 2009, Psychological review.

[26]  S. Shinkareva,et al.  Neural representation of abstract and concrete concepts: A meta‐analysis of neuroimaging studies , 2010, Human brain mapping.

[27]  Massimo Poesio,et al.  Visually Grounded and Textual Semantic Models Differentially Decode Brain Activity Associated with Concrete and Abstract Nouns , 2017, TACL.

[28]  Amy Perfors,et al.  Predicting human similarity judgments with distributional models: The value of word associations. , 2016, COLING.

[29]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Carina Silberer,et al.  Grounded Models of Semantic Representation , 2012, EMNLP.

[32]  Stephen Clark,et al.  Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception , 2015, EMNLP.

[33]  Marie-Francine Moens,et al.  Imagined Visual Representations as Multimodal Embeddings , 2017, AAAI.

[34]  Jiajun Zhang,et al.  Investigating Inner Properties of Multimodal Representation and Semantic Compositionality with Brain-based Componential Semantics , 2017, AAAI.

[35]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[36]  Carina Silberer,et al.  Visually Grounded Meaning Representations , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Stephen Clark,et al.  Vector Space Models of Lexical Meaning , 2015 .

[38]  Léon Bottou,et al.  Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics , 2014, EMNLP.

[39]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[40]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[41]  Jiajun Zhang,et al.  Learning Multimodal Word Representation via Dynamic Fusion Methods , 2018, AAAI.