EnvGAN: Adversarial Synthesis of Environmental Sounds for Data Augmentation

The research in Environmental Sound Classification (ESC) has been progressively growing with the emergence of deep learning algorithms. However, data scarcity poses a major hurdle for any huge advance in this domain. Data augmentation offers an excellent solution to this problem. While Generative Adversarial Networks (GANs) have been successful in generating synthetic speech and sounds of musical instruments, they have hardly been applied to the generation of environmental sounds. This paper presents EnvGAN, the first ever application of GANs for the adversarial generation of environmental sounds. Our experiments on three standard ESC datasets illustrate that the EnvGAN can synthesize audio similar to the ones in the datasets. The suggested method of augmentation outshines most of the futuristic techniques for audio augmentation.

[1]  Dan Stowell,et al.  Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning , 2014, PeerJ.

[2]  Xiaodong Cui,et al.  Data Augmentation for Deep Neural Network Acoustic Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[5]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[6]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[7]  Shichao Zhang,et al.  PAC-GAN: An Effective Pose Augmentation Scheme for Unsupervised Cross-View Person Re-identification , 2019, Neurocomputing.

[8]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[9]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[10]  Shugong Xu,et al.  Deep Convolutional Neural Network with Mixup for Environmental Sound Classification , 2018, PRCV.

[11]  Shugong Xu,et al.  Learning Attentive Representations for Environmental Sound Classification , 2019, IEEE Access.

[12]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[13]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[14]  Aswathy Madhu,et al.  Data Augmentation Using Generative Adversarial Network for Environmental Sound Classification , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[15]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[16]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[17]  Eyad Elyan,et al.  MFC-GAN: Class-imbalanced dataset classification using Multiple Fake Class Generative Adversarial Network , 2019, Neurocomputing.

[18]  T. Virtanen,et al.  Probabilistic Model Based Similarity Measures for Audio Query-by-Example , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[19]  Wei Shi,et al.  Dilated convolution neural network with LeakyReLU for environmental sound classification , 2017, 2017 22nd International Conference on Digital Signal Processing (DSP).

[20]  David Berthelot,et al.  BEGAN: Boundary Equilibrium Generative Adversarial Networks , 2017, ArXiv.

[21]  Lars Lundberg,et al.  Classifying environmental sounds using image recognition networks , 2017, KES.

[22]  Dumitru Erhan,et al.  Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[24]  Paul Roe,et al.  A survey of tagging techniques for music, speech and environmental sound , 2012, Artificial Intelligence Review.

[25]  Tatsuya Harada,et al.  Learning environmental sounds with end-to-end convolutional neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[27]  Yann LeCun,et al.  Moving Beyond Feature Design: Deep Architectures and Automatic Feature Learning in Music Informatics , 2012, ISMIR.

[28]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[29]  Zohaib Mushtaq,et al.  Environmental sound classification using a regularized deep convolutional neural network with data augmentation , 2020, Applied Acoustics.

[30]  Graham W. Taylor,et al.  Dataset Augmentation in Feature Space , 2017, ICLR.

[31]  Alaa Tharwat,et al.  Classification assessment methods , 2020, Applied Computing and Informatics.

[32]  Ole-Christoffer Granmo,et al.  Environment Sound Classification Using Multiple Feature Channels and Attention Based Deep Convolutional Neural Network , 2020, INTERSPEECH.

[33]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[34]  Antoni B. Chan,et al.  Genre Classification and the Invariance of MFCC Features to Key and Tempo , 2011, MMM.

[35]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[36]  Heikki Huttunen,et al.  Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Xavier Serra,et al.  Freesound technical demo , 2013, ACM Multimedia.

[38]  Manuele Bicego,et al.  Audio-Visual Event Recognition in Surveillance Video Sequences , 2007, IEEE Transactions on Multimedia.