Sound classification and localization in service robots with attention mechanisms

Human-machine interaction is calling for a sophisticated understanding of subjects’ behavior performed by smartphones, home automation and entertainment devices, and many service robots. During an interaction with human beings in their environment, a service robot has to be capable to perceive and process visual and sound information of the scene that he observes. To capture salient elements in such different signals many semi-supervised deep learning methods have been proposed. In this article, it is proposed a new convolutional neural network, endowed with a mechanism of attention in order not only to classify, but also to localize temporally a sound event, and in a semi-supervised way.

[1]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[2]  Maja Pantic,et al.  Social Signal Processing , 2017 .

[3]  Giuliano Grossi,et al.  Predictive Sampling of Facial Expression Dynamics Driven by a Latent Action Space , 2018, 2018 14th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS).

[4]  Ivan Laptev,et al.  Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Juhan Nam,et al.  Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms , 2017, ArXiv.

[7]  Maja Pantic,et al.  Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[8]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[9]  Juhan Nam,et al.  SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification , 2018 .

[10]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[11]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Matteo Bodini,et al.  Probabilistic nonlinear dimensionality reduction through gaussian process latent variable models: An overview , 2019, Computer-Aided Developments: Electronics and Communication.

[13]  Roland T. Rust,et al.  Artificial Intelligence in Service , 2018 .

[14]  P. W. Singer,et al.  Wired for War: The Robotics Revolution and Conflict in the 21st Century , 2009 .

[15]  Jochen Wirtz,et al.  Brave new world: service robots in the frontline , 2018, Journal of Service Management.

[16]  Matteo Bodini,et al.  A Review of Facial Landmark Extraction in 2D Images and Videos Using Deep Learning , 2019, Big Data Cogn. Comput..

[17]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[18]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[19]  Giuliano Grossi,et al.  Single Sample Face Recognition by Sparse Recovery of Deep-Learned LDA Features , 2018, ACIVS.

[20]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[21]  Saifur Rahman,et al.  SPEAKER IDENTIFICATION USING MEL FREQUENCY CEPSTRAL COEFFICIENTS , 2004 .

[22]  Zhuowen Tu,et al.  Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree , 2015, AISTATS.

[23]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[24]  Tatsuya Harada,et al.  Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.