Symbol Grounding in Multimodal Sequences using Recurrent Neural Networks

The problem of how infants learn to associate visual inputs, speech, and internal symbolic representation has long been of interest in Psychology, Neuroscience, and Artificial Intelligence. A priori, both visual inputs and auditory inputs are complex analog signals with a large amount of noise and context, and lacking of any segmentation information. In this paper, we address a simple form of this problem: the association of one visual input and one auditory input with each other. We show that the presented model learns both segmentation, recognition and symbolic representation under two simple assumptions: (1) that a symbolic representation exists, and (2) that two different inputs represent the same symbolic structure. Our approach uses two Long Short-Term Memory (LSTM) networks for multimodal sequence learning and recovers the internal symbolic space using an EM-style algorithm. We compared our model against LSTM in three different multimodal datasets: digit, letter and word recognition. The performance of our model reached similar results to LSTM.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Marcus Liwicki,et al.  Parallel sequence classification using recurrent neural networks and alignment , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[3]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[4]  Dana H. Ballard,et al.  A multimodal learning interface for grounding spoken language in sensory perceptions , 2004, ACM Trans. Appl. Percept..

[5]  Tomoaki Nakamura,et al.  Grounding of Word Meanings in Latent Dirichlet Allocation-Based Multimodal Concepts , 2011, Adv. Robotics.

[6]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[7]  Thomas M. Breuel,et al.  High-Performance OCR for Printed English and Fraktur Using LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[8]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[9]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[10]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[11]  Stevan Harnad The Symbol Grounding Problem , 1999, ArXiv.

[12]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[13]  Anthony G. Cohn,et al.  Protocols from perceptual observations , 2005, Artif. Intell..

[14]  L. Steels The symbol grounding problem has been solved, so what’s next? , 2008 .

[15]  Alex Graves,et al.  Connectionist Temporal Classification , 2012 .

[16]  Linda B. Smith,et al.  Shape and the first hundred nouns. , 2004, Child development.

[17]  Silvia Coradeschi,et al.  A Short Review of Symbol Grounding in Robotic and Intelligent Systems , 2013, KI - Künstliche Intelligenz.

[18]  Anne Dunlea,et al.  The impact of input: language acquisition in the visually impaired , 1993 .

[19]  P. Spencer,et al.  Looking without listening: is audition a prerequisite for normal development of visual attention during infancy? , 2000, Journal of deaf studies and deaf education.

[20]  S. Kita,et al.  Sound symbolism scaffolds language development in preverbal infants , 2015, Cortex.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[23]  S. Waxman,et al.  Do words facilitate object categorization in 9-month-old infants? , 1997, Journal of experimental child psychology.