CNNs-based Acoustic Scene Classification using Multi-Spectrogram Fusion and Label Expansions

Spectrograms have been widely used in Convolutional Neural Networks based schemes for acoustic scene classification, such as the STFT spectrogram and the MFCC spectrogram, etc. They have different time-frequency characteristics, contributing to their own advantages and disadvantages in recognizing acoustic scenes. In this letter, a novel multi-spectrogram fusion framework is proposed, making the spectrograms complement each other. In the framework, a single CNN architecture is applied onto multiple spectrograms for feature extraction. The deep features extracted from multiple spectrograms are then fused to discriminate the acoustic scenes. Moreover, motivated by the inter-class similarities in acoustic scene datasets, a label expansion method is further proposed in which super-class labels are constructed upon the original classes. On the help of the expanded labels, the CNN models are transformed into the multitask learning form to improve the acoustic scene classification by appending the auxiliary task of super-class classification. To verify the effectiveness of the proposed methods, intensive experiments have been performed on the DCASE2017 and the LITIS Rouen datasets. Experimental results show that the proposed method can achieve promising accuracies on both datasets. Specifically, accuracies of 0.9744, 0.8865 and 0.7778 are obtained for the LITIS Rouen dataset, the DCASE Development set and Evaluation set respectively.

[1]  Kyogu Lee,et al.  Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification , 2017, DCASE.

[2]  Feng Zhou,et al.  Embedding Label Structures for Fine-Grained Feature Representation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Kun Qian,et al.  Deep Scalogram Representations for Acoustic Scene Classification , 2018, IEEE/CAA Journal of Automatica Sinica.

[4]  Bo Li,et al.  Environmental Sound Classification Based on Multi-temporal Resolution CNN Network Combining with Multi-level Features , 2018, PCM.

[5]  Boualem Boashash,et al.  Time-frequency features for pattern recognition using high-resolution TFDs: A tutorial review , 2015, Digit. Signal Process..

[6]  Yoshua Bengio,et al.  Generative Adversarial Networks , 2014, ArXiv.

[7]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[8]  Dong Yu,et al.  Recent progresses in deep learning based acoustic models , 2017, IEEE/CAA Journal of Automatica Sinica.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Daniele Battaglino,et al.  Acoustic scene classification using convolutional neural networks , 2016 .

[11]  Jin Jiang,et al.  Time-frequency feature representation using energy concentration: An overview of recent advances , 2009, Digit. Signal Process..

[12]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[13]  Takumi Kobayashi,et al.  Acoustic Scene Classification based on Sound Textures and Events , 2015, ACM Multimedia.

[14]  Soo-Don Hyun,et al.  ACOUSTIC SCENE CLASSIFICATION USING PARALLEL COMBINATION OF LSTM AND CNN , 2016 .

[15]  Yiming Yang,et al.  Recursive regularization for large-scale classification with hierarchical and graphical dependencies , 2013, KDD.

[16]  Alex Graves,et al.  Long Short-Term Memory , 2020, Computer Vision.

[17]  Seongkyu Mun,et al.  GENERATIVE ADVERSARIAL NETWORK BASED ACOUSTIC SCENE TRAINING SET AUGMENTATION AND SELECTION USING SVM HYPERPLANE , 2017 .

[18]  Huy Phan,et al.  Improved Audio Scene Classification Based on Label-Tree Embeddings and Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Thomas F. Quatieri,et al.  Short-time Fourier transform , 1987 .

[20]  Judith C. Brown,et al.  An efficient algorithm for the calculation of a constant Q transform , 1992 .

[21]  BoashashBoualem,et al.  Time-frequency features for pattern recognition using high-resolution TFDs , 2015 .

[22]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yuanjie Zheng,et al.  Breast Cancer Multi-classification from Histopathological Images with Structured Deep Learning Model , 2017, Scientific Reports.

[24]  Tianbao Yang,et al.  Hyper-class augmented and regularized deep learning for fine-grained image classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Z. Feng,et al.  BUET BOSCH CONSORTIUM ( B 2 C ) ACOUSTIC SCENE CLASSIFICATION SYSTEMS FOR DCASE 2017 CHALLENGE , 2017 .

[27]  Haibo Mi,et al.  Mixup-Based Acoustic Scene Classification Using Multi-Channel Convolutional Neural Network , 2018, PCM.

[28]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[29]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[30]  Huy Phan,et al.  Audio Scene Classification with Deep Recurrent Neural Networks , 2017, INTERSPEECH.

[31]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[32]  Huy Phan,et al.  Label Tree Embeddings for Acoustic Scene Classification , 2016, ACM Multimedia.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Gerhard Widmer,et al.  CP-JKU SUBMISSIONS FOR DCASE-2016 : A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS , 2016 .

[35]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[36]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Thomas Lidy,et al.  CQT-based Convolutional Neural Networks for Audio Scene Classification , 2016, DCASE.

[38]  Shao-Hu Peng,et al.  Acoustic Scene Classification Using Deep Convolutional Neural Network and Multiple Spectrograms Fusion , 2017, DCASE.

[39]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.