Time-Frequency Feature Decomposition Based on Sound Duration for Acoustic Scene Classification

Acoustic scene classification is the task of identifying the type of acoustic environment in which a given audio signal is recorded. The signal is a mixture of sound events with various characteristics. In-depth and focused analysis is needed to find out the most representative sound patterns for recognizing and differentiating the scenes. In this paper, we propose a feature decomposition method based on temporal median filtering, and use convolutional neural network to model long-duration background sounds and transient sounds separately. Experiments on log-mel and wavelet based time-frequency features show that using the proposed method leads to better classification accuracy. Analysis of detailed experimental results reveals that (1) long-duration sounds are generally most informative for acoustic scene classification; and (2) the focus of sound duration may be different for classifying different types of acoustic scenes.

[1]  Daniel P. W. Ellis,et al.  Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016) , 2016 .

[2]  J. Jorgenson,et al.  Median filtering for removal of low-frequency background drift. , 1993, Analytical chemistry.

[3]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[4]  Yonghong Yan,et al.  An Audio Scene Classification Framework with Embedded Filters and a DCT-based Temporal Module , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Joakim Andén,et al.  Kymatio: Scattering Transforms in Python , 2018, J. Mach. Learn. Res..

[6]  Jonathan Huang,et al.  Acoustic Scene Classification Using Deep Learning-based Ensemble Averaging , 2019, DCASE.

[7]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[8]  Anssi Klapuri,et al.  Recognition of Everyday Auditory Scenes: Potentials, Latencies and Cues , 2001 .

[9]  Derry Fitzgerald,et al.  Harmonic/Percussive Separation Using Median Filtering , 2010 .

[10]  Kyogu Lee,et al.  Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification , 2017, DCASE.

[11]  Yonghong Yan,et al.  Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling , 2019, ArXiv.

[12]  Björn Schuller,et al.  Deep Sequential Image Features on Acoustic Scene Classification , 2017, DCASE.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[15]  Tan Lee,et al.  Enhancing Sound Texture in CNN-based Acoustic Scene Classification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).