Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention

The complexity of polyphonic sounds imposes numerous challenges on their classification. Especially in real life, polyphonic sound events have discontinuity and unstable time-frequency variations. Traditional single acoustic features cannot characterize the key feature information of the polyphonic sound event, and this deficiency results in poor model classification performance. In this paper, we propose a convolutional recurrent neural network model based on the temporal-frequency (TF) attention mechanism and feature space (FS) attention mechanism (TFFS-CRNN). The TFFS-CRNN model aggregates Log-Mel spectrograms and MFCCs feature as inputs, which contains the TF-attention module, the convolutional recurrent neural network (CRNN) module, the FS-attention module and the bidirectional gated recurrent unit (BGRU) module. In polyphonic sound events detection (SED), the TF-attention module can capture the critical temporal–frequency features more capably. The FS-attention module assigns different dynamically learnable weights to different dimensions of features. The TFFS-CRNN model improves the characterization of features for key feature information in polyphonic SED. By using two attention modules, the model can focus on semantically relevant time frames, key frequency bands, and important feature spaces. Finally, the BGRU module learns contextual information. The experiments were conducted on the DCASE 2016 Task3 dataset and the DCASE 2017 Task3 dataset. Experimental results show that the F1-score of the TFFS-CRNN model improved 12.4% and 25.2% compared with winning system models in DCASE challenge; the ER is reduced by 0.41 and 0.37 as well. The proposed TFFS-CRNN model algorithm has better classification performance and lower ER in polyphonic SED.

[1]  Mei Wang,et al.  Adaptive Memory-Controlled Self-Attention for Polyphonic Sound Event Detection , 2022, Symmetry.

[2]  Xingmei Wang,et al.  A capsule network with pixel-based attention and BGRU for sound event detection , 2022, Digit. Signal Process..

[3]  F. Masulli,et al.  Anomalous sound event detection: A survey of machine learning based methods and applications , 2021, Multimedia Tools and Applications.

[4]  Junyu Liu,et al.  Polyphonic Sound Event Detection Using Capsule Neural Network on Multi-Type-Multi-Scale Time-Frequency Representation , 2021, 2022 IEEE 2nd International Conference on Software Engineering and Artificial Intelligence (SEAI).

[5]  Bo Yin,et al.  Environmental sound classification using temporal-frequency attention based convolutional neural network , 2021, Scientific Reports.

[6]  Peijun Du,et al.  Channel Attention-Based Temporal Convolutional Network for Satellite Image Time Series Classification , 2021, IEEE Geoscience and Remote Sensing Letters.

[7]  Tim Fingscheidt,et al.  A New DCASE 2017 Rare Sound Event Detection Benchmark Under Equal Training Data: CRNN With Multi-Width Kernels , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Licheng Jiao,et al.  Hyperspectral Image Classification Based on 3-D Octave Convolution With Spatial–Spectral Attention Network , 2021, IEEE Transactions on Geoscience and Remote Sensing.

[9]  Reishi Kondo,et al.  Impact of Sound Duration and Inactive Frames on Sound Event Detection Performance , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Jianhui Wang,et al.  A Multi-Scale Fusion Convolutional Neural Network Based on Attention Mechanism for the Visualization Analysis of EEG Signals Decoding , 2020, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[11]  Jingjing Pan,et al.  Audio Sound Determination Using Feature Space Attention Based Convolution Recurrent Neural Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Justin Salamon,et al.  Sound Event Detection in Synthetic Domestic Environments , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Wenhao Ding,et al.  Adaptive Multi-Scale Detection of Acoustic Events , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  T. Yoshioka,et al.  Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Shugong Xu,et al.  Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification , 2019, PRCV.

[16]  Kai Yu,et al.  Duration Robust Weakly Supervised Sound Event Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Xinyu Li,et al.  Multi-stream Network With Temporal Attention For Environmental Sound Classification , 2019, INTERSPEECH.

[18]  Vinay Prakash,et al.  Identification Vehicle Movement Detection in Forest Area using MFCC and KNN , 2018, 2018 International Conference on System Modeling & Advancement in Research Trends (SMART).

[19]  Wei-Qiang Zhang,et al.  Learning How to Listen: A Temporal-Frequential Attention Model for Sound Event Detection , 2018, INTERSPEECH.

[20]  Yan Song,et al.  A Capsule based Approach for Polyphonic Sound Event Detection , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[21]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Tuomas Virtanen,et al.  A report on sound event detection with different binaural features , 2017, ArXiv.

[23]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Wei Shi,et al.  Dilated convolution neural network with LeakyReLU for environmental sound classification , 2017, 2017 22nd International Conference on Digital Signal Processing (DSP).

[25]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Thierry Dutoit,et al.  Identification of European woodpecker species in audio recordings from their drumming rolls , 2016, Ecol. Informatics.

[27]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[28]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[29]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[30]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[31]  Huy Phan,et al.  Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks , 2016, INTERSPEECH.

[32]  Heikki Huttunen,et al.  Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Nicolai Petkov,et al.  Audio Surveillance of Roads: A System for Detecting Anomalous Sounds , 2016, IEEE Transactions on Intelligent Transportation Systems.

[34]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[36]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[37]  Matthew E. P. Davies,et al.  Onset Event Decoding Exploiting the Rhythmic Structure of Polyphonic Music , 2011, IEEE Journal of Selected Topics in Signal Processing.

[38]  Tuomas Virtanen,et al.  Audio context recognition using audio event histograms , 2010, 2010 18th European Signal Processing Conference.

[39]  Yiannis Kompatsiaris,et al.  On the use of audio events for improving video scene segmentation , 2010, 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10.

[40]  Liyan Luo,et al.  A system for the detection of polyphonic sound on a university campus based on CapsNet-RNN , 2021, IEEE Access.

[41]  Doroteo T. Toledano,et al.  A Multi-Resolution CRNN-Based Approach for Semi-Supervised Sound Event Detection in DCASE 2020 Challenge , 2021, IEEE Access.

[42]  A. Mousa,et al.  Sound signal control on home appliances using Android smart-phone , 2020 .

[43]  Il-Young Jeong,et al.  Audio Event Detection Using Multiple-Input Convolutional Neural Network , 2017, DCASE.

[44]  Kyogu Lee,et al.  Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks , 2017, DCASE.

[45]  T. Virtanen,et al.  Convolutional Recurrent Neural Networks for Rare Sound Event Detection , 2017, DCASE.

[46]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[47]  Daniel P. W. Ellis,et al.  A Discriminative Model for Polyphonic Piano Transcription , 2007, EURASIP J. Adv. Signal Process..

[48]  Guodong Guo,et al.  Content-based audio classification and retrieval by support vector machines , 2003, IEEE Trans. Neural Networks.