SubSpectralNet – Using Sub-spectrogram Based Convolutional Neural Networks for Acoustic Scene Classification

Acoustic Scene Classification (ASC) is one of the core research problems in the field of Computational Sound Scene Analysis. In this work, we present SubSpectralNet, a novel model which captures discriminative features by incorporating frequency band-level differences to model soundscapes. Using mel-spectrograms, we propose the idea of using band-wise crops of the input time-frequency representations and train a convolutional neural network (CNN) on the same. We also propose a modification in the training method for more efficient learning of the CNN models. We first give a motivation for using sub-spectrograms by giving intuitive and statistical analyses and finally we develop a sub-spectrogram based CNN architecture for ASC. The system is evaluated on the public ASC development dataset provided for the "Detection and Classification of Acoustic Scenes and Events" (DCASE) 2018 Challenge. Our best model achieves an improvement of +14% in terms of classification accuracy with respect to the DCASE 2018 baseline system. Code and figures are available at https://github.com/ssrp/SubSpectralNet

[1]  Abhinav Dhall,et al.  Multi-level Dense Capsule Networks , 2018, ACCV.

[2]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Kyogu Lee,et al.  Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification , 2017, DCASE.

[4]  François Pachet,et al.  The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. , 2007, The Journal of the Acoustical Society of America.

[5]  Karol J. Piczak The Details That Matter: Frequency Resolution of Spectrograms in Acoustic Scene Classification , 2017, DCASE.

[6]  Roger Zimmermann,et al.  Learning and Fusing Multimodal Deep Features for Acoustic Scene Categorization , 2018, ACM Multimedia.

[7]  Mathieu Lagrange,et al.  The bag-of-frames approach: a not so sufficient model for urban soundscapes , 2014, The Journal of the Acoustical Society of America.

[8]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[9]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[10]  Gaël Richard,et al.  Acoustic scene classification with matrix factorization for unsupervised feature learning , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[12]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[13]  S. Squartini,et al.  DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks , 2016, DCASE.

[14]  Gaël Richard,et al.  Feature Learning With Matrix Factorization Applied to Acoustic Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  R. Beran Minimum Hellinger distance estimates for parametric models , 1977 .

[16]  Gaël Richard,et al.  HOG and subband power distribution image features for acoustic scene classification , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[17]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[19]  Tobias Watzka,et al.  Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018) , 2018 .

[20]  Mathieu Lagrange,et al.  Saliency-based modeling of acoustic scenes using sparse non-negative matrix factorization , 2013, 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS).

[21]  Mark D. Plumbley,et al.  Computational Analysis of Sound Scenes and Events , 2017 .

[22]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Huy Phan,et al.  Label Tree Embeddings for Acoustic Scene Classification , 2016, ACM Multimedia.