Acoustic Scene Classification with Spectrogram Processing Strategies

Recently, convolutional neural networks (CNN) have achieved the state-of-the-art performance in acoustic scene classification (ASC) task. The audio data is often transformed into two-dimensional spectrogram representations, which are then fed to the neural networks. In this paper, we study the problem of efficiently taking advantage of different spectrogram representations through discriminative processing strategies. There are two main contributions. The first contribution is exploring the impact of the combination of multiple spectrogram representations at different stages, which provides a meaningful reference for the effective spectrogram fusion. The second contribution is that the processing strategies in multiple frequency bands and multiple temporal frames are proposed to make fully use of a single spectrogram representation. The proposed spectrogram processing strategies can be easily transferred to any network structures. The experiments are carried out on the DCASE 2020 Task1 datasets, and the results show that our method could achieve the accuracy of 81.8% (official baseline: 54.1%) and 92.1% (official baseline: 87.3%) on the officially provided fold 1 evaluation dataset of Task1A and Task1B, respectively.

[1]  Mark D. Plumbley,et al.  PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Anh Nguyen,et al.  Sound Context Classification Basing on Join Learning Model and Multi-Spectrogram Features , 2020, International Journal of Computing.

[3]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[4]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[5]  Shugong Xu,et al.  Sub-spectrogram Segmentation for Environmental Sound Classification via Convolutional Recurrent Neural Network and Score Level Fusion , 2019, 2019 IEEE International Workshop on Signal Processing Systems (SiPS).

[6]  Mark D. Plumbley,et al.  Computational Analysis of Sound Scenes and Events , 2017 .

[7]  Annamaria Mesaros,et al.  Acoustic Scene Classification in DCASE 2020 Challenge: Generalization Across Devices and Low Complexity Solutions , 2020, DCASE.

[8]  Ian McLoughlin,et al.  Robust Acoustic Scene Classification using a Multi-Spectrogram Encoder-Decoder Framework , 2020, Digit. Signal Process..

[9]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[10]  Ian McLoughlin,et al.  Speech and Audio Processing: A MATLAB®-based Approach , 2016 .

[11]  Gerhard Widmer,et al.  CP-JKU SUBMISSIONS TO DCASE ’ 19 : ACOUSTIC SCENE CLASSIFICATION AND AUDIO TAGGING WITH RECEPTIVE-FIELD-REGULARIZED CNNS Technical Report , 2019 .

[12]  Ye Wang,et al.  SubSpectralNet – Using Sub-spectrogram Based Convolutional Neural Networks for Acoustic Scene Classification , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[15]  Lukás Burget,et al.  Convolutional Neural Networks and x-vector Embedding for DCASE2018 Acoustic Scene Classification Challenge , 2018, ArXiv.

[16]  Kyogu Lee,et al.  Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification , 2017, DCASE.

[17]  Yonghong Yan,et al.  Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling , 2019, ArXiv.

[18]  Yong Xu,et al.  Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems , 2019, ArXiv.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.