SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification

In this paper, we present SpecAugment++, a novel data augmentation method for deep neural networks based acoustic scene classification (ASC). Different from other popular data augmentation methods such as SpecAugment and mixup that only work on the input space, SpecAugment++ is applied to both the input space and the hidden space of the deep neural networks to enhance the input and the intermediate feature representations. For an intermediate hidden state, the augmentation techniques consist of masking blocks of frequency channels and masking blocks of time frames, which improve generalization by enabling a model to attend not only to the most discriminative parts of the feature, but also the entire parts. Apart from using zeros for masking, we also examine two approaches for masking based on the use of other samples within the minibatch, which helps introduce noises to the networks to make them more discriminative for classification. The experimental results on the DCASE 2018 Task1 dataset and DCASE 2019 Task1 dataset show that our proposed method can obtain 3.6% and 4.7% accuracy gains over a strong baseline without augmentation (i.e. CP-ResNet) respectively, and outperforms other previous data augmentation methods.

[1]  Yuexian Zou,et al.  A Global-Local Attention Framework for Weakly Labelled Audio Tagging , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Gerhard Widmer,et al.  Low-Complexity Models for Acoustic Scene Classification Based on Receptive Field Regularization and Frequency Damping , 2020, ArXiv.

[3]  Quanquan Gu,et al.  An Improved Analysis of Training Over-parameterized Deep Neural Networks , 2019, NeurIPS.

[4]  S. Squartini,et al.  DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks , 2016, DCASE.

[5]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[6]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[7]  Heikki Huttunen,et al.  Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yonghong Yan,et al.  Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling , 2019, ArXiv.

[9]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[10]  Mark D. Plumbley,et al.  Deep Neural Network Baseline for DCASE Challenge 2016 , 2016, DCASE.

[11]  Emmanuel Vincent,et al.  Sound Event Detection in the DCASE 2017 Challenge , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Ramit Sawhney,et al.  SpeechMix - Augmenting Deep Sound Recognition Using Hidden Space Interpolations , 2020, INTERSPEECH.

[13]  Tatsuya Harada,et al.  Learning environmental sounds with end-to-end convolutional neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[15]  Gerhard Widmer,et al.  The Receptive Field as a Regularizer in Deep Convolutional Neural Networks for Acoustic Scene Classification , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[16]  Yuexian Zou,et al.  Modeling Label Dependencies for Audio Tagging With Graph Convolutional Network , 2020, IEEE Signal Processing Letters.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Annamaria Mesaros,et al.  Acoustic Scene Classification in DCASE 2019 Challenge: Closed and Open Set Classification and Data Mismatch Setups , 2019, DCASE.

[19]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Tatsuya Harada,et al.  Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.

[23]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[24]  Yuexian Zou,et al.  Acoustic Scene Classification with Spectrogram Processing Strategies , 2020, DCASE.

[25]  Yan Song,et al.  Robust sound event recognition using convolutional neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).