Acoustic scene classification using multi-layer temporal pooling based on convolutional neural network

The performance of an Acoustic Scene Classification (ASC) system is highly depending on the latent temporal dynamics of the audio signal. In this paper, we proposed a multiple layers temporal pooling method using CNN feature sequence as in-put, which can effectively capture the temporal dynamics for an entire audio signal with arbitrary duration by building direct connections between the sequence and its time indexes. We applied our novel framework on DCASE 2018 task 1, ASC. For evaluation, we trained a Support Vector Machine (SVM) with the proposed Multi-Layered Temporal Pooling (MLTP) learned features. Experimental results on the development dataset, usage of the MLTP features significantly improved the ASC performance. The best performance with 75.28% accuracy was achieved by using the optimal setting found in our experiments.

[1]  Seongkyu Mun,et al.  GENERATIVE ADVERSARIAL NETWORK BASED ACOUSTIC SCENE TRAINING SET AUGMENTATION AND SELECTION USING SVM HYPERPLANE , 2017 .

[2]  Jun Wang,et al.  SELF-ATTENTION MECHANISM BASED SYSTEM FOR DCASE 2018 CHALLENGE TASK 1 AND TASK 4 , 2018 .

[3]  Björn W. Schuller,et al.  Semi-supervised learning helps in sound event classification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Lasheng Zhao,et al.  DCASE 2018 TASK 1 A : ACOUSTIC SCENE CLASSIFICATION BY BI-LSTM-CNN-NET MULTICHANNEL FUSION , 2018 .

[5]  Marcus Hutter,et al.  Discriminative Hierarchical Rank Pooling for Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Björn W. Schuller,et al.  Large-scale audio feature extraction and SVM for acoustic scene classification , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[7]  Bhiksha Raj,et al.  Acoustic Scene Classification Using Discrete Random Hashing for Laplacian Kernel Machines , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Kyogu Lee,et al.  Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification , 2017, DCASE.

[9]  Tomoki Toda,et al.  Bidirectional LSTM-HMM Hybrid System for Polyphonic Sound Event Detection , 2016, DCASE.

[10]  Marc-Christoph Gerasch,et al.  Acoustic Scene Classification , 2015 .

[11]  Jiqing Han,et al.  Unsupervised Temporal Feature Learning Based on Sparse Coding Embedded BoAW for Acoustic Event Recognition , 2018, INTERSPEECH.

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Wei Shi,et al.  Dilated convolution neural network with LeakyReLU for environmental sound classification , 2017, 2017 22nd International Conference on Digital Signal Processing (DSP).

[14]  Gerhard Widmer,et al.  ACOUSTIC SCENE CLASSIFICATION WITH FULLY CONVOLUTIONAL NEURAL NETWORKS AND I-VECTORS Technical Report , 2018 .

[15]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16]  Tinne Tuytelaars,et al.  Rank Pooling for Action Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[18]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[19]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Hongwei Song,et al.  A Compact and Discriminative Feature Based on Auditory Summary Statistics for Acoustic Scene Classification , 2018, INTERSPEECH.

[21]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[22]  Björn W. Schuller,et al.  Learning Audio Sequence Representations for Acoustic Event Classification , 2017, Expert Syst. Appl..

[23]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[24]  Luc Van Gool,et al.  Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection , 2016, ArXiv.