Acoustic Scene Classification for Mismatched Recording Devices Using Heated-Up Softmax and Spectrum Correction

Deep neural networks (DNNs) are successful in applications with matching inference and training distributions. In realworld scenarios, DNNs have to cope with truly new data samples during inference, potentially coming from a shifted data distribution. This usually causes a drop in performance. Acoustic scene classification (ASC) with different recording devices is one of this situation. Furthermore, an imbalance in quality and amount of data recorded by different devices causes severe challenges. In this paper, we introduce two calibration methods to tackle these challenges. In particular, we applied scaling of the features to deal with varying frequency response of the recording devices. Furthermore, to account for the shifted data distribution, a heated-up softmax is embedded to calibrate the predictions of the model. We use robust and resource-efficient models, and show the efficiency of heated-up softmax. Our ASC system reaches state-of-the-art performance on the development set of DCASE challenge 2019 task 1B with only ~70K parameters. It achieves 70.1% average classification accuracy for device B and device C. It performs on par with the best single model system of the DCASE 2019 challenge and outperforms the baseline system by 28.7% (absolute).

[1]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[3]  Kyogu Lee,et al.  Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification , 2017, DCASE.

[4]  Xinxing Chen,et al.  ACOUSTIC SCENE CLASSIFICATION USING MULTI-SCALE FEATURES Technical Report , 2018 .

[5]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[6]  Franz Pernkopf,et al.  Acoustic Scene Classification with Mismatched Recording Devices Using Mixture of Experts Layer , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[7]  Gerhard Widmer,et al.  Deep Within-Class Covariance Analysis for Robust Audio Representation Learning , 2017 .

[8]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[9]  Dmitriy Serdyuk,et al.  Unsupervised adversarial domain adaptation for acoustic scene classification , 2018, ArXiv.

[10]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[11]  Mark D. McDonnell,et al.  Acoustic Scene Classification Using Deep Residual Networks with Late Fusion of Separated High and Low Frequency Paths , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  R. Palaniappan,et al.  A MULTI-SPECTROGRAM DEEP NEURAL NETWORK FOR ACOUSTIC SCENE CLASSIFICATION Technical Report , 2019 .

[13]  M. Kosmider,et al.  CALIBRATING NEURAL NETWORKS FOR SECONDARY RECORDING DEVICES Technical Report , 2019 .

[14]  Franz Pernkopf,et al.  Acoustic scene classification using a convolutional neural network ensemble and nearest neighbor filters , 2018, DCASE.

[15]  FEATURE ENHANCEMENT FOR ROBUST ACOUSTIC SCENE CLASSIFICATION WITH DEVICE MISMATCH Technical Report , 2019 .

[16]  Wei Zhang,et al.  Heated-Up Softmax Embedding , 2018, ArXiv.

[17]  Suwon Shon,et al.  Domain Mismatch Robust Acoustic Scene Classification Using Channel Information Conversion , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.