Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation

In recent years, neural network approaches have shown superior performance to conventional hand-made features in numerous application areas. In particular, convolutional neural networks (ConvNets) exploit spatially local correlations across input data to improve the performance of audio processing tasks, such as speech recognition, musical chord recognition, and onset detection. Here we apply ConvNet to acoustic scene classification, and show that the error rate can be further decreased by using delta features in the frequency domain. We propose a multiple-width frequency-delta (MWFD) data augmentation method that uses static mel-spectrogram and frequency-delta features as individual input examples. In addition, we describe a ConvNet output aggregation method designed for MWFD augmentation, folded mean aggregation, which combines output probabilities of static and MWFD features from the same analysis window using multiplication first, rather than taking an average of all output probabilities. We describe calculation results using the DCASE 2016 challenge dataset, which shows that ConvNet outperforms both of the baseline system with hand-crafted features and a deep neural network approach by around 7%. The performance was further improved (by 5.7%) using the MWFD augmentation together with folded mean aggregation. The system exhibited a classification accuracy of 0.831 when classifying 15 acoustic scenes.

[1]  Juhan Nam,et al.  Sparse feature learning for instrument identification: Effects of sampling and pooling methods. , 2016, The Journal of the Acoustical Society of America.

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[5]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[6]  P. Herrera,et al.  RECURRENCE QUANTIFICATION ANALYSIS FEATURES FOR AUDITORY SCENE CLASSIFICATION , 2013 .

[7]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[10]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[11]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[12]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[13]  Juhan Nam,et al.  Learning Sparse Feature Representations for Music Annotation and Retrieval , 2012, ISMIR.

[14]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[15]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[16]  Douglas Eck,et al.  Temporal Pooling and Multiscale Learning for Automatic Annotation and Ranking of Music Audio , 2011, ISMIR.

[17]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[18]  Thomas Grill,et al.  Music Boundary Detection Using Neural Networks on Combined Features and Two-Level Annotations , 2015, ISMIR.

[19]  Ronald G. Dreslinski,et al.  Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers , 2015, ASPLOS.

[20]  A. Platzer Visualization of SNPs with t-SNE , 2013, PloS one.

[21]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[22]  Thomas Grill,et al.  Boundary Detection in Music Structure Analysis using Convolutional Neural Networks , 2014, ISMIR.

[23]  Björn Schuller,et al.  RECOGNISING ACOUSTIC SCENES WITH LARGE-SCALE AUDIO FEATURE EXTRACTION AND SVM , 2013 .

[24]  Ariel Habshush,et al.  IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events IEEE AASP SCENE CLASSIFICATION CHALLENGE USING HIDDEN MARKOV MODELS AND FRAME BASED CLASSIFICATION , 2013 .

[25]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[26]  Kyogu Lee,et al.  Detecting fingering of overblown flute sound using sparse feature learning , 2016, EURASIP J. Audio Speech Music. Process..

[27]  Dan Stowell,et al.  Detection and classification of acoustic scenes and events: An IEEE AASP challenge , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[28]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[29]  Tara N. Sainath,et al.  Deep Convolutional Neural Networks for Large-scale Speech Tasks , 2015, Neural Networks.

[30]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[31]  Jae-Hun Kim,et al.  Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[33]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[34]  Tuomas Virtanen,et al.  Context-dependent sound event detection , 2013, EURASIP Journal on Audio, Speech, and Music Processing.

[35]  Sebastian Böck,et al.  Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Dong Yu,et al.  Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[37]  Peter Li,et al.  Automatic Instrument Recognition in Polyphonic Music Using Convolutional Neural Networks , 2015, ArXiv.

[38]  Gerald Friedland,et al.  AN I-VECTOR BASED APPROACH FOR AUDIO SCENE DETECTION , 2013 .

[39]  Eric O. Postma,et al.  Texton-based analysis of paintings , 2010, Optical Engineering + Applications.

[40]  Juan Pablo Bello,et al.  Rethinking Automatic Chord Recognition with Convolutional Neural Networks , 2012, 2012 11th International Conference on Machine Learning and Applications.

[41]  Dan Stowell,et al.  A database and challenge for acoustic scene classification and event detection , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[42]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[43]  Mounya Elhilali,et al.  MULTIRESOLUTION AUDITORY REPRESENTATIONS FOR SCENE CLASS IFICATION , 2013 .

[44]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).