论文信息 - Sound classification and localization in service robots with attention mechanisms

Sound classification and localization in service robots with attention mechanisms

Human-machine interaction is calling for a sophisticated understanding of subjects’ behavior performed by smartphones, home automation and entertainment devices, and many service robots. During an interaction with human beings in their environment, a service robot has to be capable to perceive and process visual and sound information of the scene that he observes. To capture salient elements in such different signals many semi-supervised deep learning methods have been proposed. In this article, it is proposed a new convolutional neural network, endowed with a mechanism of attention in order not only to classify, but also to localize temporally a sound event, and in a semi-supervised way.

Matteo Bodini | Matteo Bodini

[1] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[2] Maja Pantic,et al. Social Signal Processing , 2017 .

[3] Giuliano Grossi,et al. Predictive Sampling of Facial Expression Dynamics Driven by a Latent Action Space , 2018, 2018 14th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS).

[4] Ivan Laptev,et al. Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Juhan Nam,et al. Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms , 2017, ArXiv.

[7] Maja Pantic,et al. Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[8] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[9] Juhan Nam,et al. SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification , 2018 .

[10] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[11] Xiaogang Wang,et al. Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[12] Matteo Bodini,et al. Probabilistic nonlinear dimensionality reduction through gaussian process latent variable models: An overview , 2019, Computer-Aided Developments: Electronics and Communication.

[13] Roland T. Rust,et al. Artificial Intelligence in Service , 2018 .

[14] P. W. Singer,et al. Wired for War: The Robotics Revolution and Conflict in the 21st Century , 2009 .

[15] Jochen Wirtz,et al. Brave new world: service robots in the frontline , 2018, Journal of Service Management.

[16] Matteo Bodini,et al. A Review of Facial Landmark Extraction in 2D Images and Videos Using Deep Learning , 2019, Big Data Cogn. Comput..

[17] Justin Salamon,et al. A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[18] Alex Graves,et al. Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[19] Giuliano Grossi,et al. Single Sample Face Recognition by Sparse Recovery of Deep-Learned LDA Features , 2018, ACIVS.

[20] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[21] Saifur Rahman,et al. SPEAKER IDENTIFICATION USING MEL FREQUENCY CEPSTRAL COEFFICIENTS , 2004 .

[22] Zhuowen Tu,et al. Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree , 2015, AISTATS.

[23] Beth Logan,et al. Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[24] Tatsuya Harada,et al. Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.