Learning discriminative and robust time-frequency representations for environmental sound classification

Convolutional neural networks (CNN) are one of the best-performing neural network architectures for environmental sound classification (ESC). Recently, attention mechanisms have been used in CNN to capture the useful information from the audio signal for sound classification, especially for weakly labelled data where the timing information about the acoustic events is not available in the training data, apart from the availability of sound class labels. In these methods, however, the inherent time-frequency characteristics and variations are not explicitly exploited when obtaining the deep features. In this paper, we propose a new method, called time-frequency enhancement block (TFBlock), which temporal attention and frequency attention are employed to enhance the features from relevant frames and frequency bands. Compared with other attention mechanisms, in our method, parallel branches are constructed which allow the temporal and frequency features to be attended respectively in order to mitigate interference from the sections where no sound events happened in the acoustic environments. The experiments on three benchmark ESC datasets show that our method improves the classification performance and also exhibits robustness to noise.

[1]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[2]  Lee M. Miller,et al.  Tuning In to Sound: Frequency-Selective Attentional Filter in Human Primary Auditory Cortex , 2013, The Journal of Neuroscience.

[3]  Shugong Xu,et al.  Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification , 2019, PRCV.

[4]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[5]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[6]  Soomyung Park,et al.  Convolutional Recurrent Neural Networks for Urban Sound Classification Using Raw Waveforms , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[7]  Wei Dai,et al.  Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Mark D. Plumbley,et al.  Weakly Labelled AudioSet Tagging With Attention Neural Networks , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Tatsuya Harada,et al.  Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.

[10]  Juhan Nam,et al.  Raw Waveform-based Audio Classification Using Sample-level CNN Architectures , 2017, NIPS 2017.

[11]  Feng Liu,et al.  Learning Environmental Sounds with Multi-scale Convolutional Neural Network , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[12]  Qiang Huang,et al.  Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging , 2017, INTERSPEECH.

[13]  Michel Vacher,et al.  Sound Classification in a Smart Room Environment: an Approach using GMM and HMM Methods , 2007 .

[14]  Hemant A. Patil,et al.  Novel TEO-based Gammatone features for environmental sound classification , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[15]  Maarten De Vos,et al.  Spatio-Temporal Attention Pooling for Audio Scene Classification , 2019, INTERSPEECH.

[16]  Mark D. Plumbley,et al.  Computational Analysis of Sound Scenes and Events , 2017 .

[17]  Xinyu Li,et al.  Multi-stream Network With Temporal Attention For Environmental Sound Classification , 2019, INTERSPEECH.

[18]  R. Radhakrishnan,et al.  Audio analysis for surveillance applications , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[19]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[20]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[21]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[22]  Erik Christensen,et al.  The Hadamard product in a crossed product C*-algebra , 2019, 1905.05630.