Symbol Grounding Association in Multimodal Sequences with Missing Elements

In this paper, we extend a symbolic association framework for being able to handle missing elements in multimodal sequences. The general scope of the work is the symbolic associations of object-word mappings as it happens in language development in infants. In other words, two different representations of the same abstract concepts can associate in both directions. This scenario has been long interested in Artificial Intelligence, Psychology, and Neuroscience. In this work, we extend a recent approach for multimodal sequences (visual and audio) to also cope with missing elements in one or both modalities. Our method uses two parallel Long Short-Term Memories (LSTMs) with a learning rule based on EM-algorithm. It aligns both LSTM outputs via Dynamic Time Warping (DTW). We propose to include an extra step for the combination with the max operation for exploiting the common elements between both sequences. The motivation behind is that the combination acts as a condition selector for choosing the best representation from both LSTMs. We evaluated the proposed extension in the following scenarios: missing elements in one modality (visual or audio) and missing elements in both modalities (visual and sound). The performance of our extension reaches better results than the original model and similar results to individual LSTM trained in each modality.

[1]  L. Steels The symbol grounding problem has been solved, so what’s next? , 2008 .

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Marcus Liwicki,et al.  Symbol Grounding in Multimodal Sequences using Recurrent Neural Networks , 2015, CoCo@NIPS.

[4]  Sameer A. Nene,et al.  Columbia Object Image Library (COIL100) , 1996 .

[5]  Alex Graves,et al.  Connectionist Temporal Classification , 2012 .

[6]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[7]  Björn W. Schuller,et al.  A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams , 2009, Neurocomputing.

[8]  P. Spencer,et al.  Looking without listening: is audition a prerequisite for normal development of visual attention during infancy? , 2000, Journal of deaf studies and deaf education.

[9]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[10]  S. Kita,et al.  Sound symbolism scaffolds language development in preverbal infants , 2015, Cortex.

[11]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[12]  Thomas M. Breuel,et al.  High-Performance OCR for Printed English and Fraktur Using LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[13]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[14]  Linda B. Smith,et al.  Shape and the first hundred nouns. , 2004, Child development.

[15]  Dana H. Ballard,et al.  A multimodal learning interface for grounding spoken language in sensory perceptions , 2004, ACM Trans. Appl. Percept..

[16]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[17]  Anne Dunlea,et al.  The impact of input: language acquisition in the visually impaired , 1993 .

[18]  Stevan Harnad The Symbol Grounding Problem , 1999, ArXiv.

[19]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[20]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[21]  Honglak Lee,et al.  Improved Multimodal Deep Learning with Variation of Information , 2014, NIPS.

[22]  Marcus Liwicki,et al.  Scene labeling with LSTM recurrent neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[24]  Tomoaki Nakamura,et al.  Grounding of Word Meanings in Latent Dirichlet Allocation-Based Multimodal Concepts , 2011, Adv. Robotics.

[25]  Daniel L. Silver,et al.  A Scalable Unsupervised Deep Multimodal Learning System , 2016, FLAIRS.

[26]  Artur S. d'Avila Garcez,et al.  A Neural-Symbolic Cognitive Agent for Online Learning and Reasoning , 2011, IJCAI.

[27]  Anthony G. Cohn,et al.  Protocols from perceptual observations , 2005, Artif. Intell..

[28]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[30]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..