Robust Feature Learning on Long-Duration Sounds for Acoustic Scene Classification

Acoustic scene classification (ASC) aims to identify the type of scene (environment) in which a given audio signal is recorded. The log-mel feature and convolutional neural network (CNN) have recently become the most popular time-frequency (TF) feature representation and classifier in ASC. An audio signal recorded in a scene may include various sounds overlapping in time and frequency. The previous study suggests that separately considering the long-duration sounds and short-duration sounds in CNN may improve ASC accuracy. This study addresses the problem of the generalization ability of acoustic scene classifiers. In practice, acoustic scene signals' characteristics may be affected by various factors, such as the choice of recording devices and the change of recording locations. When an established ASC system predicts scene classes on audios recorded in unseen scenarios, its accuracy may drop significantly. The long-duration sounds not only contain domain-independent acoustic scene information, but also contain channel information determined by the recording conditions, which is prone to over-fitting. For a more robust ASC system, We propose a robust feature learning (RFL) framework to train the CNN. The RFL framework down-weights CNN learning specifically on long-duration sounds. The proposed method is to train an auxiliary classifier with only long-duration sound information as input. The auxiliary classifier is trained with an auxiliary loss function that assigns less learning weight to poorly classified examples than the standard cross-entropy loss. The experimental results show that the proposed RFL framework can obtain a more robust acoustic scene classifier towards unseen devices and cities.

[1]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[2]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[3]  Yonghong Yan,et al.  Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling , 2019, ArXiv.

[4]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[5]  Gerhard Widmer,et al.  CP-JKU SUBMISSIONS FOR DCASE-2016 : A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS , 2016 .

[6]  Hanseok Ko,et al.  Deep Neural Network Bottleneck Feature for Acoustic Scene Classification , 2016 .

[7]  Jonathan Huang,et al.  Acoustic Scene Classification Using Deep Learning-based Ensemble Averaging , 2019, DCASE.

[8]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[9]  A. Mesaros,et al.  TAU Urban Acoustic Scenes 2020 Mobile, Development dataset , 2020 .

[10]  Kuldip K. Paliwal,et al.  Preference for 20-40 ms window duration in speech analysis , 2010, 2010 4th International Conference on Signal Processing and Communication Systems.

[11]  Sungrack Yun,et al.  Discriminative training of GMM parameters for audio scene classification and audio tagging , 2016 .

[12]  Hirokazu Kameoka,et al.  A Real-time Equalizer of Harmonic and Percussive Components in Music Signals , 2008, ISMIR.

[13]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[14]  P. Herrera,et al.  RECURRENCE QUANTIFICATION ANALYSIS FEATURES FOR AUDITORY SCENE CLASSIFICATION , 2013 .

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Tuomas Virtanen,et al.  Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine , 2005, 2005 13th European Signal Processing Conference.

[17]  Wootaek Lim,et al.  Harmonic and percussive source separation using a convolutional auto encoder , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[18]  Björn Schuller,et al.  Deep Sequential Image Features on Acoustic Scene Classification , 2017, DCASE.

[19]  Tan Lee,et al.  STRATIFIED TIME-FREQUENCY FEATURES FOR CNN-BASED ACOUSTIC SCENE CLASSIFICATION Technical Report , 2019 .

[20]  Chin-Hui Lee,et al.  Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation , 2020, ArXiv.

[21]  Toan H. Vu,et al.  ACOUSTIC SCENE AND EVENT RECOGNITION USING RECURRENT NEURAL NETWORKS , 2016 .

[22]  J. Sola,et al.  Importance of input data normalization for the application of neural networks to complex industrial problems , 1997 .

[23]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  J. Jorgenson,et al.  Median filtering for removal of low-frequency background drift. , 1993, Analytical chemistry.

[25]  Derry Fitzgerald,et al.  Harmonic/Percussive Separation Using Median Filtering , 2010 .

[26]  Tan Lee,et al.  Enhancing Sound Texture in CNN-based Acoustic Scene Classification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[28]  Tan Lee,et al.  Time-Frequency Feature Decomposition Based on Sound Duration for Acoustic Scene Classification , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Kyogu Lee,et al.  Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification , 2017, DCASE.

[30]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..