Expectation Learning for Stimulus Prediction Across Modalities Improves Unisensory Classification

Expectation learning is a unsupervised learning process which uses multisensory bindings to enhance unisensory perception. For instance, as humans, we learn to associate a barking sound with the visual appearance of a dog, and we continuously fine-tune this association over time, as we learn, e.g., to associate high-pitched barking with small dogs. In this work, we address the problem of developing a computational model that addresses important properties of expectation learning, in particular focusing on the lack of explicit external supervision other than temporal co-occurrence. To this end, we present a novel hybrid neural model based on audio-visual autoencoders and a recurrent self-organizing network for multisensory bindings that facilitate stimulus reconstructions across different sensory modalities. We refer to this mechanism as stimulus prediction across modalities and demonstrate that the proposed model is capable of learning concept bindings by evaluating it on unisensory classification tasks for audio-visual stimuli using the 43,500 Youtube videos from the animal subset of the AudioSet corpus.

[1]  Stephen R. Marsland,et al.  A self-organising network that grows when required , 2002, Neural Networks.

[2]  Sam Chen,et al.  Image-Based Plant Recognition by Fusion of Multimodal Information , 2016, 2016 10th International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS).

[3]  Stefan Wermter,et al.  Deep Neural Object Analysis by Interactive Auditory Exploration with a Humanoid Robot , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[4]  C. Spence,et al.  Attracting attention to the illusory location of a sound: reflexive crossmodal orienting and ventriloquism , 2000, Neuroreport.

[5]  Stefan Wermter,et al.  Emotion-modulated attention improves expression recognition: A deep learning model , 2017, Neurocomputing.

[6]  Hang Li,et al.  Variation Autoencoder Based Network Representation Learning for Classification , 2017, ACL.

[7]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Sebastian Risi,et al.  Born to Learn: the Inspiration, Progress, and Future of Evolved Plastic Artificial Neural Networks , 2017, Neural Networks.

[9]  Salvador Soto-Faraco,et al.  Conscious access to the unisensory components of a cross-modal illusion , 2007, Neuroreport.

[10]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2019, Neural Networks.

[11]  Aren Jansen,et al.  Unsupervised Learning of Semantic Audio Representations , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[13]  Nigel Cross,et al.  Creativity in the design process: co-evolution of problem–solution , 2001 .

[14]  Feiran Huang,et al.  Multimodal Network Embedding via Attention based Multi-view Variational Autoencoder , 2018, ICMR.

[15]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  J. Wessberg,et al.  The Neurobiology Shaping Affective Touch: Expectation, Motivation, and Meaning in the Multisensory Context , 2016, Front. Psychol..

[17]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[18]  N. Bolognini,et al.  Enhancement of visual perception by crossmodal visuo-auditory interaction , 2002, Experimental Brain Research.

[19]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[20]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[22]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yao Zhao,et al.  Multimodal Fusion for Video Search Reranking , 2010, IEEE Transactions on Knowledge and Data Engineering.

[24]  A. McIntosh,et al.  The co-occurrence of multisensory facilitation and cross-modal conflict in the human brain. , 2011, Journal of neurophysiology.

[25]  Tae-Hyun Oh,et al.  On Learning Associations of Faces and Voices , 2018, ACCV.

[26]  D. Senkowski,et al.  The multifaceted interplay between attention and multisensory integration , 2010, Trends in Cognitive Sciences.

[27]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[28]  Stefan Wermter,et al.  Towards End-to-End Raw Audio Music Synthesis , 2018, ICANN.

[29]  F Gregory Ashby,et al.  The role of feedback contingency in perceptual category learning. , 2016, Journal of experimental psychology. Learning, memory, and cognition.

[30]  C. Spence,et al.  Crossmodal Mental Imagery , 2013 .

[31]  E. Macaluso Multisensory Processing in Sensory-Specific Cortical Areas , 2006, The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry.

[32]  T. Sejnowski,et al.  Early Cross-Modal Interactions in Auditory and Visual Cortex Underlie a Sound-Induced Visual Illusion , 2007, The Journal of Neuroscience.

[33]  Hideyoshi Yanagisawa Expectation Effect Theory and Its Modeling , 2016 .

[34]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[35]  Stefan Wermter,et al.  Lifelong Learning of Action Representations with Deep Neural Self-Organization , 2017, AAAI Spring Symposia.

[36]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[37]  Christoph Kayser,et al.  Multisensory Causal Inference in the Brain , 2015, PLoS biology.

[38]  Geoffrey E. Hinton,et al.  Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search , 2018, ACL.

[39]  Chong-Wah Ngo,et al.  Blind late fusion in multimedia event retrieval , 2016, International Journal of Multimedia Information Retrieval.

[40]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[41]  Alexandre Pouget,et al.  A computational perspective on the neural basis of multisensory spatial representations , 2002, Nature Reviews Neuroscience.

[42]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[43]  J. Driver Enhancement of selective listening by illusory mislocation of speech sounds due to lip-reading , 1996, Nature.

[44]  Anurag Kumar,et al.  Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[46]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[47]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Frank Biocca,et al.  Visual Touch in Virtual Environments: An Exploratory Study of Presence, Multimodal Interfaces, and Cross-Modal Sensory Illusions , 2001, Presence: Teleoperators & Virtual Environments.

[49]  Linda R. Elliott,et al.  Information transfer within human robot teams: Multimodal attention management in human-robot interaction , 2017, 2017 IEEE Conference on Cognitive and Computational Aspects of Situation Management (CogSIMA).

[50]  Terrence R Stanford,et al.  A Model of the Neural Mechanisms Underlying Multisensory Integration in the Superior Colliculus , 2007, Perception.