Unsupervised feature learning for environmental sound classification using Weighted Cycle-Consistent Generative Adversarial Network

In this paper we propose a novel environmental sound classification approach incorporating unsupervised feature learning from codebook via spherical $K$-Means++ algorithm and a new architecture for high-level data augmentation. The audio signal is transformed into a 2D representation using a discrete wavelet transform (DWT). The DWT spectrograms are then augmented by a novel architecture for cycle-consistent generative adversarial network. This high-level augmentation bootstraps generated spectrograms in both intra and inter class manners by translating structural features from sample to sample. A codebook is built by coding the DWT spectrograms with the speeded-up robust feature detector (SURF) and the K-Means++ algorithm. The Random Forest is our final learning algorithm which learns the environmental sound classification task from the clustered codewords in the codebook. Experimental results in four benchmarking environmental sound datasets (ESC-10, ESC-50, UrbanSound8k, and DCASE-2017) have shown that the proposed classification approach outperforms the state-of-the-art classifiers in the scope, including advanced and dense convolutional neural networks such as AlexNet and GoogLeNet, improving the classification rate between 3.51% and 14.34%, depending on the dataset.

[1]  Gert R. G. Lanckriet,et al.  Codebook-Based Audio Feature Representation for Music Information Retrieval , 2013, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[3]  Lonce L. Wyse,et al.  Audio Spectrogram Representations for Processing with Convolutional Neural Networks , 2017, ArXiv.

[4]  Xavier Serra,et al.  Randomly Weighted CNNs for (Music) Audio Classification , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Patrick Cardinal,et al.  End-to-End Environmental Sound Classification using a 1D Convolutional Neural Network , 2019, Expert Syst. Appl..

[6]  Zengchang Qin,et al.  Emotion Classification with Data Augmentation Using Generative Adversarial Networks , 2018, PAKDD.

[7]  R. Radhakrishnan,et al.  Audio analysis for surveillance applications , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[8]  Lars Lundberg,et al.  Classifying environmental sounds using image recognition networks , 2017, KES.

[9]  Dimitri Palaz,et al.  Convolutional Neural Networks-based continuous speech recognition using raw speech signal , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Ömer Nezih Gerek,et al.  Compression of power quality event data using 2D representation , 2008 .

[11]  Sadaaki Miyamoto,et al.  Spherical k-Means++ Clustering , 2015, MDAI.

[12]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[13]  Mark J. F. Gales,et al.  An improved approach to the hidden Markov model decomposition of speech and noise , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  S. Mori,et al.  Effect of Coils on Natural Frequencies of Stator Cores in Small Induction Motors , 1987, IEEE Transactions on Energy Conversion.

[15]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Bhiksha Raj,et al.  Unsupervised hierarchical structure induction for deeper semantic analysis of audio , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Alceu de Souza Britto,et al.  A Novel Orthogonal Direction Mesh Adaptive Direct Search Approach for SVM Hyperparameter Tuning , 2019, ArXiv.

[20]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[21]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[22]  Stéphane Mallat,et al.  Group Invariant Scattering , 2011, ArXiv.

[23]  Patrick J. Van Fleet,et al.  Discrete Wavelet Transformations: An Elementary Approach with Applications , 2019 .

[24]  Dan Stowell,et al.  Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning , 2014, PeerJ.

[25]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[27]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[28]  Benjamin Schrauwen,et al.  Multiscale Approaches To Music Audio Feature Learning , 2013, ISMIR.

[29]  J. Todd Book Review: Digital image processing (second edition). By R. C. Gonzalez and P. Wintz, Addison-Wesley, 1987. 503 pp. Price: £29.95. (ISBN 0-201-11026-1) , 1988 .

[30]  Jingyu Wang,et al.  Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion , 2019, Sensors.

[31]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[32]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[33]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[34]  Takumi Kobayashi,et al.  Urban sound event classification based on local and global features aggregation , 2017 .

[35]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[38]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[39]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[41]  Trung Le,et al.  MGAN: Training Generative Adversarial Nets with Multiple Generators , 2018, ICLR.

[42]  Antonio J. Rubio,et al.  Feature extraction combining spectral noise reduction and cepstral histogram equalization for robust ASR , 2002, INTERSPEECH.

[43]  John W. Fisher,et al.  Dreaming More Data: Class-dependent Distributions over Diffeomorphisms for Learned Data Augmentation , 2015, AISTATS.

[44]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[45]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[48]  Bo Li,et al.  Environmental Sound Classification Based on Multi-temporal Resolution CNN Network Combining with Multi-level Features , 2018, PCM.

[49]  Lie Lu,et al.  A flexible framework for key audio effects detection and auditory context inference , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Fillia Makedon,et al.  Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition , 2017, Comput..

[51]  Justin Salamon,et al.  Unsupervised feature learning for urban sound classification , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[53]  Anurag Kumar,et al.  Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[55]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[56]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[57]  R. Andrzejak,et al.  Cross recurrence quantification for cover song identification , 2009 .

[58]  Eric R. Ziegel,et al.  Engineering Statistics , 2004, Technometrics.

[59]  Patrick Cardinal,et al.  A Robust Approach for Securing Audio Classification Against Adversarial Attacks , 2019, IEEE Transactions on Information Forensics and Security.

[60]  Daniel P. W. Ellis,et al.  Classifying Music Audio with Timbral and Chroma Features , 2007, ISMIR.

[61]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[62]  Tuomas Virtanen,et al.  Context-dependent sound event detection , 2013, EURASIP Journal on Audio, Speech, and Music Processing.

[63]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[64]  Justin Salamon,et al.  Feature learning with deep scattering for urban sound analysis , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[65]  Juan Pablo Bello,et al.  A Software Framework for Musical Data Augmentation , 2015, ISMIR.

[66]  Pedro Gómez Vilda,et al.  Dimensionality Reduction of a Pathological Voice Quality Assessment System Based on Gaussian Mixture Models and Short-Term Cepstral Parameters , 2006, IEEE Transactions on Biomedical Engineering.

[67]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[68]  Wei Dai,et al.  Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[69]  Tatsuya Harada,et al.  Learning environmental sounds with end-to-end convolutional neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[70]  Keansub Lee,et al.  Minimal-impact audio-based personal archives , 2004, CARPE'04.

[71]  Nikos Fakotakis,et al.  Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task , 2007 .

[72]  Daniel P. W. Ellis,et al.  Spectral vs. spectro-temporal features for acoustic event detection , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[73]  Shao-Hu Peng,et al.  Acoustic Scene Classification Using Deep Convolutional Neural Network and Multiple Spectrograms Fusion , 2017, DCASE.

[74]  Carlos Soares,et al.  A Comparison of Ranking Methods for Classification Algorithm Selection , 2000, ECML.

[75]  Luca Maria Gambardella,et al.  High-Performance Neural Networks for Visual Object Classification , 2011, ArXiv.

[76]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.