Deep Feature Embedding and Hierarchical Classification for Audio Scene Classification

In this work, we propose an approach that features deep feature embedding learning and hierarchical classification with triplet loss function for Acoustic Scene Classification (ASC). In the one hand, a deep convolutional neural network is firstly trained to learn a feature embedding from scene audio signals. Via the trained convolutional neural network, the learned embedding embeds an input into the embedding feature space and transforms it into a high-level feature vector for representation. In the other hand, in order to exploit the structure of the scene categories, the original scene classification problem is structured into a hierarchy where similar categories are grouped into meta-categories. Then, hierarchical classification is accomplished using deep neural network classifiers associated with triplet loss function. Our experiments show that the proposed system achieves good performance on both the DCASE 2018 Task 1A and 1B datasets, resulting in accuracy gains of 15.6% and 16.6% absolute over the DCASE 2018 baseline on Task 1A and 1B, respectively.

[1]  Goutam Saha,et al.  WAVELET-BASED AUDIO FEATURES FOR ACOUSTIC SCENE CLASSIFICATION Technical Report , 2018 .

[2]  Franz Pernkopf,et al.  Acoustic scene classification using a convolutional neural network ensemble and nearest neighbor filters , 2018, DCASE.

[3]  Mark D. Plumbley,et al.  Attention-based Atrous Convolutional Neural Networks: Visualisation and Understanding Perspectives of Acoustic Scenes , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Ye Wang,et al.  SubSpectralNet – Using Sub-spectrogram Based Convolutional Neural Networks for Acoustic Scene Classification , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Matthieu Cord,et al.  Exploring deep vision models for acoustic scene classification , 2018, DCASE.

[6]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[7]  Andreas Seiderer,et al.  Using an evolutionary approach to explore convolutional neural networks for acoustic scene classification , 2018, DCASE.

[8]  Yanxiong Li,et al.  THE SEIE-SCUT SYSTEMS FOR CHALLENGE ON DCASE 2018 : DEEP LEARNING TECHNIQUES FOR AUDIO REPRESENTATION AND CLASSIFICATION , 2018 .

[9]  Haibo Mi,et al.  Mixup-Based Acoustic Scene Classification Using Multi-Channel Convolutional Neural Network , 2018, PCM.

[10]  Xinxing Chen,et al.  ACOUSTIC SCENE CLASSIFICATION USING MULTI-SCALE FEATURES Technical Report , 2018 .

[11]  Jun Du,et al.  A Hybrid Approach to Acoustic Scene Classification Based on Universal Acoustic Models , 2019, INTERSPEECH.

[12]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[13]  Toan H. Vu,et al.  ACOUSTIC SCENE CLASSIFICATION USING ENSEMBLE OF CONVNETS Technical Report , 2018 .

[14]  Lukás Burget,et al.  Convolutional Neural Networks and x-vector Embedding for DCASE2018 Acoustic Scene Classification Challenge , 2018, ArXiv.

[15]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Franz Pernkopf,et al.  Acoustic Scene Classification with Mismatched Recording Devices Using Mixture of Experts Layer , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Wei Liu,et al.  ACOUSTIC SCENE CLASSIFICATION BASED ON BINAURAL DEEP SCATTERING SPECTRA WITH CNN AND LSTM Technical Report , 2018 .

[20]  Jiqing Han,et al.  Acoustic scene classification using multi-layer temporal pooling based on convolutional neural network , 2019, ArXiv.

[21]  Mark D. Plumbley,et al.  Attention-based convolutional neural networks for acoustic scene classification , 2018, DCASE.

[22]  Hongwei Song,et al.  Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events , 2019, INTERSPEECH.

[23]  Jürgen Tchorz COMBINATION OF AMPLITUDE MODULATION SPECTROGRAM FEATURES AND MFCCS FOR ACOUSTIC SCENE CLASSIFICATION , 2018 .

[24]  Franz Pernkopf,et al.  Acoustic Scene Classification with Mismatched Devices Using CliqueNets and Mixup Data Augmentation , 2019, INTERSPEECH.

[25]  Hye-jin Shim,et al.  DNN based multi-level feature ensemble for acoustic scene classification , 2018, DCASE.

[26]  Yong Xu,et al.  DCASE 2018 Challenge Surrey cross-task convolutional neural network baseline , 2018, DCASE.

[27]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[28]  Jun Wang,et al.  SELF-ATTENTION MECHANISM BASED SYSTEM FOR DCASE 2018 CHALLENGE TASK 1 AND TASK 4 , 2018 .

[29]  Ian McLoughlin,et al.  A Robust Framework for Acoustic Scene Classification , 2019, INTERSPEECH.

[30]  Tatsuya Harada,et al.  Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.

[31]  Lasheng Zhao,et al.  DCASE 2018 TASK 1 A : ACOUSTIC SCENE CLASSIFICATION BY BI-LSTM-CNN-NET MULTICHANNEL FUSION , 2018 .