Urban Sound Classification : striving towards a fair comparison

Urban sound classification has been achieving remarkable progress and is still an active research area in audio pattern recognition. In particular, it allows to monitor the noise pollution, which becomes a growing concern for large cities. The contribution of this paper is two-fold. First, we present our DCASE 2020 task 5 winning solution which aims at helping the monitoring of urban noise pollution. It achieves a macro-AUPRC of 0.82 / 0.62 for the coarse / fine classification on validation set. Moreover, it reaches accuracies of 89.7% and 85.41% respectively on ESC-50 and US8k datasets. Second, it is not easy to find a fair comparison and to reproduce the performance of existing models. Sometimes authors copy-pasting the results of the original papers which is not helping reproducibility. As a result, we provide a fair comparison by using the same input representation, metrics and optimizer to assess performances. We preserve data augmentation used by the original papers. We hope this framework could help evaluate new architectures in this field. For better reproducibility, the code is available on our GitHub repository.

[1]  Hemant A. Patil,et al.  Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification , 2017, INTERSPEECH.

[2]  Diganta Misra,et al.  Mish: A Self Regularized Non-Monotonic Neural Activation Function , 2019, ArXiv.

[3]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[4]  Alexandr A. Kalinin,et al.  Albumentations: fast and flexible image augmentations , 2018, Inf..

[5]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[6]  Yuexian Zou,et al.  Environmental Sound Classification with Parallel Temporal-Spectral Attention , 2019, INTERSPEECH.

[7]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[8]  Lei Zhang,et al.  Gradient Centralization: A New Optimization Technique for Deep Neural Networks , 2020, ECCV.

[9]  Florian Metze,et al.  A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[12]  Andreas Dengel,et al.  ESResNet: Environmental Sound Classification Based on Visual Domain Models , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[13]  Vincent Lostanlen,et al.  SONYC-UST-V2: An Urban Sound Tagging Dataset with Spatiotemporal Context , 2020, DCASE.

[14]  Chenxi Liu,et al.  Micro-Batch Training with Batch-Channel Normalization and Weight Standardization , 2019 .

[15]  Mark D. Plumbley,et al.  Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Yun Wang Polyphonic Sound Event Detection with Weak Labeling , 2017 .

[17]  Wei Shen,et al.  Weight Standardization , 2019, ArXiv.

[18]  Dan Stowell,et al.  Detection and Classification of Acoustic Scenes and Events , 2015, IEEE Transactions on Multimedia.

[19]  Feng Liu,et al.  Learning Environmental Sounds with Multi-scale Convolutional Neural Network , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[20]  Nicolas Riche,et al.  CRNNs for Urban Sound Tagging with spatiotemporal context , 2020, ArXiv.

[21]  Sainath Adapa,et al.  Urban Sound Tagging using Convolutional Neural Networks , 2019, DCASE.

[22]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[23]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[26]  Mark D. Plumbley,et al.  PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Yanxiong Li,et al.  Sound Event Detection with Depthwise Separable and Dilated Convolutions , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[28]  Diganta Misra Mish: A Self Regularized Non-Monotonic Activation Function , 2020, BMVC.

[29]  Nicholay Topin,et al.  Super-convergence: very fast training of neural networks using large learning rates , 2018, Defense + Commercial Sensing.