Acoustic scene classification using convolutional neural networks

Acoustic scene classification (ASC) aims to distinguish between different acoustic environments and is a technology which can be used by smart devices for contextualization and personalization. Standard algorithms exploit hand-crafted features which are unlikely to offer the best potential for reliable classification. This paper reports the first application of convolutional neural networks (CNNs) to ASC, an approach which learns discriminant features automatically from spectral representations of raw acoustic data. A principal influence on performance comes from the specific convolutional filters which can be adjusted to capture different spectrotemporal, recurrent acoustic structure. The proposed CNN approach is shown to outperform a Gaussian mixture model baseline for the DCASE 2016 database even though training data is sparse.

[1]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[2]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[3]  Colin Raffel,et al.  librosa: 0.4.1 , 2015 .

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Aurélien Mayoue,et al.  Deep neural networks for audio scene recognition , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[6]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015 .

[7]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[8]  Nitish Srivastava,et al.  Improving Neural Networks with Dropout , 2013 .

[9]  Björn Schuller,et al.  RECOGNISING ACOUSTIC SCENES WITH LARGE-SCALE AUDIO FEATURE EXTRACTION AND SVM , 2013 .

[10]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[11]  R. M. Schafer,et al.  The Soundscape: Our Sonic Environment and the Tuning of the World , 1993 .

[12]  Nicholas W. D. Evans,et al.  Acoustic context recognition using local binary pattern codebooks , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[13]  P. Herrera,et al.  RECURRENCE QUANTIFICATION ANALYSIS FEATURES FOR AUDITORY SCENE CLASSIFICATION , 2013 .

[14]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[15]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[17]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[18]  Mounya Elhilali,et al.  MULTIRESOLUTION AUDITORY REPRESENTATIONS FOR SCENE CLASS IFICATION , 2013 .

[19]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[20]  Sebastian Böck,et al.  Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).