Environment sound classification using an attention-based residual neural network

Abstract Complexity of environmental sounds impose numerous challenges for their classification. The performance of Environmental Sound Classification (ESC) depends greatly on how good the fea extraction technique employed to extract generic and prototypical features from a sound is. The presence of silent and semantically irrelevant frames is ubiquitous during the classification of environmental sounds. To deal with such issues that persist in environmental sound classification, we introduce a novel attention-based deep model that supports focusing on semantically relevant fre The proposed attention guided deep model efficiently learns spatio-temporal relationships that exist in the spectrogram of a signal. The efficacy of the proposed method is evaluated on two widely used Environmental Sound Classification datasets: ESC-10 and DCASE 2019 Task-1(A) datasets. The experiments performed and their results demonstrate that the proposed method yields comparable performance to state-of-the-art techniques. We obtained improvements of 11.50% and 19.50% in accuracy as compared to the accuracy of the baseline models of the ESC-10 and DCASE 2019 Task-1(A) datasets respectively. To support the attention outcomes that have focused on relevant regions, visual analysis of the attention feature map has also been presented. The resultant attention feature map conveys that the model focuses only on the spectrogram’s semantically relevant regions while skipping the irrelevant regions.

[1]  Wolfram Burgard,et al.  The limits and potentials of deep learning for robotics , 2018, Int. J. Robotics Res..

[2]  Preeth Raguraman,et al.  LibROSA Based Assessment Tool for Music Information Retrieval Systems , 2019, 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR).

[3]  Yi Yang,et al.  Adaptive Exploration for Unsupervised Person Re-identification , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[4]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[5]  Mark D. Plumbley,et al.  Attention-based convolutional neural networks for acoustic scene classification , 2018, DCASE.

[6]  Ziqiang Shi,et al.  ATReSN-Net: Capturing Attentive Temporal Relations in Semantic Neighborhood for Acoustic Scene Classification , 2020, INTERSPEECH.

[7]  Abeer Alwan,et al.  Attention Based CLDNNs for Short-Duration Acoustic Scene Classification , 2017, INTERSPEECH.

[8]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Zohaib Mushtaq,et al.  Environmental sound classification using a regularized deep convolutional neural network with data augmentation , 2020, Applied Acoustics.

[10]  Yan Song,et al.  Robust Sound Event Classification Using Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Gaël Richard,et al.  Nonnegative Feature Learning Methods for Acoustic Scene Classification , 2017 .

[12]  Gaël Richard,et al.  Feature Learning With Matrix Factorization Applied to Acoustic Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Ziqiang Shi,et al.  Learning Temporal Relations from Semantic Neighbors for Acoustic Scene Classification , 2020, IEEE Signal Processing Letters.

[14]  Hye-jin Shim,et al.  Distilling the Knowledge of Specialist Deep Neural Networks in Acoustic Scene Classification , 2019 .

[15]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[16]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[17]  Jingyu Wang,et al.  Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion , 2019, Sensors.

[18]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Erhan Akbal,et al.  An automated environmental sound classification methods based on statistical and textural feature , 2020 .

[20]  Lars Lundberg,et al.  Classifying environmental sounds using image recognition networks , 2017, KES.

[21]  Germain Forestier,et al.  Deep learning for time series classification: a review , 2018, Data Mining and Knowledge Discovery.

[22]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[23]  P. Dhanalakshmi,et al.  Classification of audio signals using AANN and GMM , 2011, Appl. Soft Comput..

[24]  Abdulkadir Sengur,et al.  Environmental sound classification using optimum allocation sampling based empirical mode decomposition , 2020 .

[25]  Shuzhi Sam Ge,et al.  ADCM: attention dropout convolutional module , 2020, Neurocomputing.

[26]  Fei Wu,et al.  Recurrent Attention Network with Reinforced Generator for Visual Dialog , 2020, ACM Trans. Multim. Comput. Commun. Appl..