STDP Based Unsupervised Multimodal Learning With Cross-Modal Processing in Spiking Neural Networks

Spiking neural networks perform reasonably well in recognition applications for single modality (e.g., images, audio, or text). In this paper, we propose a multimodal spiking neural network that combines two modalities (image and audio). The two unimodal ensembles are connected with cross-modal connections and the entire network is trained with unsupervised learning. The network receives inputs in both modalities for the same class and predicts the class label. The excitatory connections in the unimodal ensemble and the cross-modal connections are trained with power-law weight-dependent spike timing dependent plasticity learning rule. The cross-modal connections capture the correlation between neurons of different modalities. The multimodal network learns features of both modalities and improves the classification accuracy compared to unimodal topology, even when one of the modality is distorted by noise. The cross-modal connections suppress the effect of noise on classification accuracy. The well-learned cross-modal connections invoke additional spiking activity in neurons of the correct label. The cross-modal connections are only excitatory and do not inhibit the normal activity of the unimodal ensembles. We evaluated our multimodal network on images from MNIST dataset and utterances of digits from TI46 speech corpus. The multimodal network achieved a classification accuracy of 98% on the combined MNIST and TI46 dataset.

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  Kaushik Roy,et al.  EnsembleSNN: Distributed assistive STDP learning for energy-efficient recognition in spiking neural networks , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[3]  Matthew Cook,et al.  Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[4]  A. Giraud,et al.  Implicit Multisensory Associations Influence Voice Recognition , 2006, PLoS biology.

[5]  Joonki Paik,et al.  Multi-Modal Human Verification Using Face and Speech , 2006, Fourth IEEE International Conference on Computer Vision Systems (ICVS'06).

[6]  Jason M. Allred,et al.  ASP: Learning to Forget With Adaptive Synaptic Plasticity in Spiking Neural Networks , 2017, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[7]  Meng Wang,et al.  Multimodal Deep Autoencoder for Human Pose Recovery , 2015, IEEE Transactions on Image Processing.

[8]  Richard F. Lyon,et al.  A computational model of filtering, detection, and compression in the cochlea , 1982, ICASSP.

[9]  Thomas Sikora,et al.  Audiovisual Anchorperson Detection for Topic-Oriented Navigation in Broadcast News , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[10]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[11]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[12]  Sander M. Bohte,et al.  SpikeProp: backpropagation for networks of spiking neurons , 2000, ESANN.

[13]  Liam McDaid,et al.  SWAT: A Spiking Neural Network Training Algorithm for Classification Problems , 2010, IEEE Transactions on Neural Networks.

[14]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[15]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[16]  S. Thorpe,et al.  STDP-based spiking deep convolutional neural networks for object recognition , 2018 .

[17]  Shigang Yue,et al.  Fast unsupervised learning for visual pattern recognition using spike timing dependent plasticity , 2017, Neurocomputing.

[18]  S. Kita,et al.  Sound symbolism scaffolds language development in preverbal infants , 2015, Cortex.

[19]  Yong Zhang,et al.  A Digital Liquid State Machine With Biologically Inspired Learning and Its Application to Speech Recognition , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Romain Brette,et al.  Neuroinformatics Original Research Article Brian: a Simulator for Spiking Neural Networks in Python , 2022 .

[21]  Matthew Cook,et al.  Unsupervised learning of digit recognition using spike-timing-dependent plasticity , 2015, Front. Comput. Neurosci..

[22]  Kuldip K. Paliwal,et al.  Identity verification using speech and face information , 2004, Digit. Signal Process..

[23]  Sangram Ganguly,et al.  Learning Sparse Feature Representations Using Probabilistic Quadtrees and Deep Belief Nets , 2015, Neural Processing Letters.

[24]  Sidney S. Simon,et al.  Merging of the Senses , 2008, Front. Neurosci..

[25]  G. Calvert Crossmodal processing in the human brain: insights from functional neuroimaging studies. , 2001, Cerebral cortex.