Emergent spatio-temporal multimodal learning using a developmental network

Conventional machine learning needs humans to train each module with hand-handcrafted data and symbols manually, and the results of these methods are confined to particular tasks. To address this limitation, in this paper we design a multimodal autonomous learning architecture based on a developmental network for the audio and vision co-development. The developmental network is a biological inspired mechanism, which can make an agent to develop and integrate audition and vision simultaneously. Furthermore, synapse maintenance is introduced in the vision information learning to enhance the video recognition rate and neuron regenesis mechanism is implemented to enhance the network usage efficiency. In the experiments, a number of fundamental words are acquired and identified using the proposed learning methodology without any prior knowledge about the objects or the verbal questions before running. The experiments show that the proposed learning method can achieve significantly high recognition rates in comparison with the state-of-the-art method.

[1]  Olivier Sigaud,et al.  Deep unsupervised network for multimodal perception, representation and classification , 2015, Robotics Auton. Syst..

[2]  Oliver Brdiczka,et al.  Detecting small group activities from multimodal observations , 2009, Applied Intelligence.

[3]  OrponenPekka Computational complexity of neural networks , 1994 .

[4]  Neil Mercer,et al.  Words and Minds : How We Use Language to Think Together , 2000 .

[5]  Juyang Weng,et al.  Spatio–Temporal Multimodal Developmental Learning , 2010, IEEE Transactions on Autonomous Mental Development.

[6]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[7]  Lei Liu,et al.  Emergent face orientation recognition with internal neurons of the developmental network , 2018, Progress in Artificial Intelligence.

[8]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[9]  Hong Yan,et al.  An online spatio-temporal tensor learning model for visual tracking and its applications to facial expression recognition , 2017, Expert Syst. Appl..

[10]  David G. Stork,et al.  Using deformable templates to infer visual speech dynamics , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[11]  Jing Huang,et al.  Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  D. N. Spinelli,et al.  Modification of the distribution of receptive field orientation in cats by selective visual exposure during development , 1971, Experimental Brain Research.

[13]  Qun Dai,et al.  Batch-normalized Mlpconv-wise supervised pre-training network in network , 2017, Applied Intelligence.

[14]  Juyang Weng,et al.  On developmental mental architectures , 2007, Neurocomputing.

[15]  Ge Yu,et al.  Multimodal learning for topic sentiment analysis in microblogging , 2017, Neurocomputing.

[16]  Peter Glöckner,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[17]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[18]  L. Trainor,et al.  Multisensory object perception in infancy: 4-month-olds perceive a mistuned harmonic as a separate auditory and visual object , 2017, Cognition.

[19]  Juyang Weng,et al.  Symbolic Models and Emergent Models: A Review , 2012, IEEE Transactions on Autonomous Mental Development.

[20]  Jonathan D. Cohen,et al.  Rubber hands ‘feel’ touch that eyes see , 1998, Nature.

[21]  Juyang Weng,et al.  Why Have We Passed “ Neural Networks Do Not Abstract Well ” ? , 2011 .

[22]  Shigeru Katagiri,et al.  Prototype-based minimum error training for speech recognition , 1994, Applied Intelligence.

[23]  Lei Liu,et al.  How internal neurons represent the short context: an emergent perspective , 2017, Progress in Artificial Intelligence.

[24]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[25]  M.E. Hennecke,et al.  Automatic speech recognition system using acoustic and visual signals , 1995, Conference Record of The Twenty-Ninth Asilomar Conference on Signals, Systems and Computers.

[26]  Dongshu Wang,et al.  Face Recognition in Complex Background: Developmental Network and Synapse Maintenance , 2015 .

[27]  Lei Liu,et al.  Developmental Network: An Internal Emergent Object Feature Learning , 2017, Neural Processing Letters.

[28]  Juyang Weng,et al.  Motivated Optimal Developmental Learning for Sequential Tasks Without Using Rigid Time-Discounts , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[29]  M. Wallace,et al.  Learning to Associate Auditory and Visual Stimuli: Behavioral and Neural Mechanisms , 2015, Brain Topography.

[30]  Corinne A. Bareham,et al.  Role of the right inferior parietal cortex in auditory selective attention: An rTMS study , 2018, Cortex.

[31]  James L. McClelland,et al.  Autonomous Mental Development by Robots and Animals , 2001, Science.

[32]  Juyang Weng,et al.  Brain-Like Emergent Spatial Processing , 2012, IEEE Transactions on Autonomous Mental Development.

[33]  Olivier Mangin,et al.  Learning semantic components from subsymbolic multimodal perception , 2013, 2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[34]  Gang Song,et al.  A novel double deep ELMs ensemble system for time series forecasting , 2017, Knowl. Based Syst..

[35]  Jean-Philippe Thiran,et al.  Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition , 2008, ICMI '08.

[36]  Juyang Weng,et al.  Hierarchical Discriminant Regression , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  Israel Cohen,et al.  A deep architecture for audio-visual voice activity detection in the presence of transients , 2018, Signal Process..

[38]  Lin Ma,et al.  Multimodal learning for facial expression recognition , 2015, Pattern Recognit..