Emergence of multimodal action representations from neural network self-organization

The integration of multisensory information plays a crucial role in autonomous robotics to forming robust and meaningful representations of the environment. In this work, we investigate how robust multimodal representations can naturally develop in a self-organizing manner from co-occurring multisensory inputs. We propose a hierarchical architecture with growing self-organizing neural networks for learning human actions from audiovisual inputs. The hierarchical processing of visual inputs allows to obtain progressively specialized neurons encoding latent spatiotemporal dynamics of the input, consistent with neurophysiological evidence for increasingly large temporal receptive windows in the human cortex. Associative links to bind unimodal representations are incrementally learned by a semi-supervised algorithm with bidirectional connectivity. Multimodal representations of actions are obtained using the co-activation of action features from video sequences and labels from automatic speech recognition. Experimental results on a dataset of 10 full-body actions show that our system achieves state-of-the-art classification performance without requiring the manual segmentation of training samples, and that congruent visual representations can be retrieved from recognized speech in the absence of visual stimuli. Together, these results show that our hierarchical neural architecture accounts for the development of robust multimodal representations from dynamic audiovisual inputs.

[1]  R. Adolphs Cognitive neuroscience: Cognitive neuroscience of human social behaviour , 2003, Nature Reviews Neuroscience.

[2]  Linda B. Smith,et al.  Infants rapidly learn word-referent mappings via cross-situational statistics , 2008, Cognition.

[3]  John J. Foxe,et al.  Multisensory interactions in early evoked brain activity follow the principle of inverse effectiveness , 2011, NeuroImage.

[4]  Kathy Hirsh-Pasek,et al.  An Emergentist Coalition Model for Word Learning , 2000 .

[5]  H T Siegelmann,et al.  The global landscape of cognition: hierarchical aggregation as an organizational principle of human cortical networks and functions , 2015, Scientific Reports.

[6]  G. Calvert Crossmodal processing in the human brain: insights from functional neuroimaging studies. , 2001, Cerebral cortex.

[7]  Michael S Beauchamp,et al.  See me, hear me, touch me: multisensory integration in lateral occipital-temporal cortex , 2005, Current Opinion in Neurobiology.

[8]  T. Kohonen Self-organized formation of topographically correct feature maps , 1982 .

[9]  John J. Foxe,et al.  Dual mechanisms for the cross-sensory spread of attention: how much do learned associations matter? , 2010, Cerebral cortex.

[10]  D. Zipser,et al.  Analysis of Direction Selectivity Arising from Recurrent , 1998 .

[11]  Mauro Ursino,et al.  Neurocomputational approaches to modelling multisensory integration in the brain: A review , 2014, Neural Networks.

[12]  R. Zatorre,et al.  Voice-selective areas in human auditory cortex , 2000, Nature.

[13]  E. Gould,et al.  Adult Neurogenesis in the Mammalian Brain , 2002 .

[14]  T. Martínez,et al.  Competitive Hebbian Learning Rule Forms Perfectly Topology Preserving Maps , 1993 .

[15]  Gregory McCarthy,et al.  Polysensory interactions along lateral temporal regions evoked by audiovisual speech. , 2003, Cerebral cortex.

[16]  Dedre Gentner,et al.  Why Nouns Are Learned before Verbs: Linguistic Relativity Versus Natural Partitioning. Technical Report No. 257. , 1982 .

[17]  Stefan Wermter,et al.  Attention modeled as information in learning multisensory integration , 2015, Neural Networks.

[18]  D. Heeger,et al.  A Hierarchy of Temporal Receptive Windows in Human Cortex , 2008, The Journal of Neuroscience.

[19]  M. Thirkettle,et al.  Contributions of form, motion and task to biological motion perception. , 2009, Journal of vision.

[20]  Lois Bloom,et al.  Language and Interaction. (Book Reviews: The Transition from Infancy to Language. Acquiring the Power of Expression.) , 1995 .

[21]  G. Rizzolatti,et al.  Neural and Computational Mechanisms of Action Processing: Interaction between Visual and Motor Representations , 2015, Neuron.

[22]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[23]  M. Hallett,et al.  Neural Correlates of Auditory–Visual Stimulus Onset Asynchrony Detection , 2001, The Journal of Neuroscience.

[24]  Angelo Cangelosi,et al.  Posture Affects How Robots and Infants Map Words to Objects , 2015, PloS one.

[25]  Emily D. Grossman,et al.  Necessary but not sufficient: Motion perception is required for perceiving biological motion , 2008, Vision Research.

[26]  H. Bülthoff,et al.  Merging the senses into a robust percept , 2004, Trends in Cognitive Sciences.

[27]  Ryan A. Stevenson,et al.  Audiovisual integration in human superior temporal sulcus: Inverse effectiveness and the neural processing of speech and object recognition , 2009, NeuroImage.

[28]  Barbara Hammer,et al.  Merge SOM for temporal data , 2005, Neurocomputing.

[29]  P. Fonlupt Perception and judgement of physical causality involve different brain structures. , 2003, Brain research. Cognitive brain research.

[30]  J. C. Stanley Computer simulation of a model of habituation , 1976, Nature.

[31]  Friedemann Pulvermüller,et al.  Brain mechanisms linking language and action , 2005, Nature Reviews Neuroscience.

[32]  Rajiv Khosla,et al.  Socially Assistive Robots in Elderly Care: A Mixed-Method Systematic Literature Review , 2014, Int. J. Hum. Comput. Interact..

[33]  Stephen R. Marsland,et al.  A self-organising network that grows when required , 2002, Neural Networks.

[34]  Stephen R. Marsland,et al.  On-line novelty detection for autonomous mobile robots , 2005, Robotics Auton. Syst..

[35]  B. Stein,et al.  The Merging of the Senses , 1993 .

[36]  Philipp Cimiano,et al.  Online Labelling Strategies for Growing Neural Gas , 2011, IDEAL.

[37]  Risto Miikkulainen,et al.  Computational Maps in the Visual Cortex , 2005 .

[38]  R. Vogels,et al.  Functional differentiation of macaque visual temporal cortical neurons using a parametric action space. , 2009, Cerebral cortex.

[39]  Stefan Wermter,et al.  Self-organizing neural integration of pose-motion features for human action recognition , 2015, Front. Neurorobot..

[40]  Stefan Wermter,et al.  Improving Domain-independent Cloud-Based Speech Recognition with Domain-Dependent Phonetic Post-Processing , 2014, AAAI.

[41]  T. Poggio,et al.  Cognitive neuroscience: Neural mechanisms for the recognition of biological movements , 2003, Nature Reviews Neuroscience.

[42]  Luke E. Miller,et al.  Individual differences in the perception of biological motion: Links to social cognition and motor imagery , 2013, Cognition.

[43]  Chris I. Baker,et al.  Integration of Visual and Auditory Information by Superior Temporal Sulcus Neurons Responsive to the Sight of Actions , 2005, Journal of Cognitive Neuroscience.

[44]  R. Zatorre,et al.  Human temporal-lobe response to vocal sounds. , 2002, Brain research. Cognitive brain research.

[45]  T. Stanford,et al.  The neural basis of multisensory integration in the midbrain: Its organization and maturation , 2009, Hearing Research.

[46]  B. Argall,et al.  Integration of Auditory and Visual Information about Objects in Superior Temporal Sulcus , 2004, Neuron.

[47]  Yuki Suga,et al.  Multimodal integration learning of robot behavior using deep neural networks , 2014, Robotics Auton. Syst..

[48]  Riitta Hari,et al.  Audiovisual Integration of Letters in the Human Brain , 2000, Neuron.

[49]  S. Decoene The transition from infancy to language. Acquiring the power of expression - Bloom,L , 1996 .

[50]  Jun Tani,et al.  Self-Organization of Spatio-Temporal Hierarchy via Learning of Dynamic Visual Image Patterns on Action Sequences , 2015, PloS one.

[51]  Igor Farkas,et al.  A Multimodal Connectionist Architecture for Unsupervised Grounding of Spatial Language , 2013, Cognitive Computation.

[52]  Martin A. Giese,et al.  Learning Representations of Animated Motion Sequences - A Neural Model , 2014, Top. Cogn. Sci..

[53]  Michael S. Beauchamp,et al.  Touch, sound and vision in human superior temporal sulcus , 2008, NeuroImage.

[54]  E Macaluso,et al.  Spatial and temporal factors during processing of audiovisual speech: a PET study , 2004, NeuroImage.

[55]  Elizabeth Gould,et al.  How widespread is adult neurogenesis in mammals? , 2007, Nature Reviews Neuroscience.

[56]  Petros Daras,et al.  Real-Time Skeleton-Tracking-Based Human Action Recognition Using Kinect Data , 2014, MMM.

[57]  F. Gage,et al.  Neurogenesis in the adult human hippocampus , 1998, Nature Medicine.

[58]  Paul Mineiro,et al.  Analysis of Direction Selectivity Arising from Recurrent Cortical Interactions , 1998, Neural Computation.

[59]  R. Campbell,et al.  Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex , 2000, Current Biology.

[60]  Stefan Wermter,et al.  Human Action Recognition with Hierarchical Growing Neural Gas Learning , 2014, ICANN.

[61]  John J. Foxe,et al.  Multisensory auditory-somatosensory interactions in early cortical processing revealed by high-density electrical mapping. , 2000, Brain research. Cognitive brain research.

[62]  C. Nelson Neural plasticity and human development: the role of early experience in sculpting memory systems , 2000 .

[63]  T. Allison,et al.  Social perception from visual cues: role of the STS region , 2000, Trends in Cognitive Sciences.

[64]  Bernd Fritzke,et al.  A Growing Neural Gas Network Learns Topologies , 1994, NIPS.

[65]  Bernd Fritzke,et al.  A Self-Organizing Network that Can Follow Non-stationary Distributions , 1997, ICANN.

[66]  G. Ming,et al.  Adult Neurogenesis in the Mammalian Brain: Significant Answers and Significant Questions , 2011, Neuron.

[67]  C. Malsburg,et al.  How patterned neural connections can be set up by self-organization , 1976, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[68]  Michal VavreckaIgor A Multimodal Connectionist Architecture for Unsupervised Grounding of Spatial Language , 2014 .

[69]  Rajendra Kumar,et al.  Human Action Recognition , 2012 .

[70]  A. Anastasi Individual differences. , 2020, Annual review of psychology.

[71]  C. Honey,et al.  Topographic Mapping of a Hierarchy of Temporal Receptive Windows Using a Narrated Story , 2011, The Journal of Neuroscience.