论文信息 - Environment Sound Classification using Multiple Feature Channels and Deep Convolutional Neural Networks

Environment Sound Classification using Multiple Feature Channels and Deep Convolutional Neural Networks

In this paper, we propose a model for the Environment Sound Classification Task (ESC) that consists of multiple feature channels given as input to a Deep Convolutional Neural Network (CNN). The novelty of the paper lies in using multiple feature channels consisting of Mel-Frequency Cepstral Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), the Constant Q-transform (CQT) and Chromagram. Such multiple features have never been used before for signal or audio processing. Also, we employ a deeper CNN (DCNN) compared to previous models, consisting of 2D separable convolutions working on time and feature domain separately. The model also consists of max pooling layers that downsample time and feature domain separately. We use some data augmentation techniques to further boost performance. Our model is able to achieve state-of-the-art performance on all three benchmark environment sound classification datasets, i.e. the UrbanSound8K (98.60%), ESC-10 (97.25%) and ESC-50 (95.50%). To the best of our knowledge, this is the first time that a single environment sound classification model is able to achieve state-of-the-art results on all three datasets and by a considerable margin over the previous models. For ESC-10 and ESC-50 datasets, the accuracy achieved by the proposed model is beyond human accuracy of 95.7% and 81.3% respectively.

[1] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Steve Renals,et al. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition , 2015, INTERSPEECH.

[3] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[4] Dong Yu,et al. Deep Neural Network-Hidden Markov Model Hybrid Systems , 2015 .

[5] Dong Yu,et al. Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[6] Anssi Klapuri,et al. State of the Art Report: Audio-Based Music Structure Analysis , 2010, ISMIR.

[7] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[8] Thomas Lidy,et al. CQT-based Convolutional Neural Networks for Audio Scene Classification , 2016, DCASE.

[9] Li Shi-qiang. Design and Implementation of a Audio Classification System Based on SVM , 2010 .

[10] Cem Anil,et al. TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer , 2018, ICLR.

[11] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[13] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[14] Muhammad Huzaifah,et al. Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks , 2017, ArXiv.

[15] Dima Ruinskiy,et al. A Decision-Tree-Based Algorithm for Speech/Music Classification and Segmentation , 2009, EURASIP J. Audio Speech Music. Process..

[16] Christian Wellekens,et al. On desensitizing the Mel-cepstrum to spurious spectral components for robust speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[17] Tianqi Chen,et al. Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[18] Christian Schörkhuber. CONSTANT-Q TRANSFORM TOOLBOX FOR MUSIC PROCESSING , 2010 .

[19] Jingyu Wang,et al. Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion , 2019, Sensors.

[20] Florian Metze,et al. A comparison of Deep Learning methods for environmental sound detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] DeLiang Wang,et al. An auditory-based feature for robust speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22] Beth Logan,et al. Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[23] R. Shepard. Circularity in Judgments of Relative Pitch , 1964 .

[24] Justin Salamon,et al. A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[25] Stan Davis,et al. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[26] Wilson Burgos. GAMMATONE AND MFCC FEATURES IN SPEAKER RECOGNITION , 2014 .

[27] Christophe Garcia,et al. Simplifying ConvNets for Fast Learning , 2012, ICANN.

[28] Colin Raffel,et al. librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[29] Justin Salamon,et al. Unsupervised feature learning for urban sound classification , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[31] Fahad Taha Al-Dhief,et al. Spoken language identification based on the enhanced self-adjusting extreme learning machine approach , 2018, PloS one.

[32] Yuxing Peng,et al. Environmental Sound Classification Based on Multi-temporal Resolution CNN Network Combining with Multi-level Features , 2018, ArXiv.

[33] M. Stone. Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[34] Jhing-Fa Wang,et al. Content-Based Audio Classification Using Support Vector Machines and Independent Component Analysis , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[35] Danilo Comminiello,et al. Music classification using extreme learning machines , 2013, 2013 8th International Symposium on Image and Signal Processing and Analysis (ISPA).

[36] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Goutam Saha,et al. Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition , 2012, Speech Commun..

[38] Xavier Serra,et al. Essentia: An Audio Analysis Library for Music Information Retrieval , 2013, ISMIR.

[39] François Chollet,et al. Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Karol J. Piczak. Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[41] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[42] Maheshkumar H. Kolekar,et al. Music Genre Recognition Using Deep Neural Networks and Transfer Learning , 2018, INTERSPEECH.

[43] Lakhmi C. Jain,et al. Knowledge-Based and Intelligent Information and Engineering Systems , 2011, Lecture Notes in Computer Science.

[44] Chun-Yan Yu,et al. SOUND EVENT DETECTION USING DEEP RANDOM FOREST , 2017 .

[45] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[46] Tatsuya Harada,et al. Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.

[47] Patrick Cardinal,et al. End-to-End Environmental Sound Classification using a 1D Convolutional Neural Network , 2019, Expert Syst. Appl..

[48] Lars Lundberg,et al. Classifying environmental sounds using image recognition networks , 2017, KES.

[49] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[50] Shugong Xu,et al. Deep Convolutional Neural Network with Mixup for Environmental Sound Classification , 2018, PRCV.

[51] Wei Dai,et al. Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52] Tatsuya Harada,et al. Learning environmental sounds with end-to-end convolutional neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53] Shruti Aggarwal,et al. Classification of Audio Data using Support Vector Machine , 2011 .

[54] Hemant A. Patil,et al. Novel TEO-based Gammatone features for environmental sound classification , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[55] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[56] Tara N. Sainath,et al. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[57] Judith C. Brown. Calculation of a constant Q spectral transform , 1991 .

[58] Hemant A. Patil,et al. Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification , 2017, PReMI.

[59] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60] Qiang Chen,et al. Network In Network , 2013, ICLR.

[61] Meinard Müller,et al. Information retrieval for music and motion , 2007 .

[62] C.-C. Jay Kuo,et al. Environmental sound recognition: A survey , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[63] Xavier Serra,et al. Randomly Weighted CNNs for (Music) Audio Classification , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64] Justin Salamon,et al. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[65] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[66] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.