Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification

In this paper, we propose a domain adaptation framework to address the device mismatch issue in acoustic scene classification leveraging upon neural label embedding (NLE) and relational teacher student learning (RTSL). Taking into account the structural relationships between acoustic scene classes, our proposed framework captures such relationships which are intrinsically device-independent. In the training stage, transferable knowledge is condensed in NLE from the source domain. Next in the adaptation stage, a novel RTSL strategy is adopted to learn adapted target models without using paired source-target data often required in conventional teacher student learning. The proposed framework is evaluated on the DCASE 2018 Task1b data set. Experimental results based on AlexNet-L deep classification models confirm the effectiveness of our proposed approach for mismatch situations. NLE-alone adaptation compares favourably with the conventional device adaptation and teacher student based adaptation techniques. NLE with RTSL further improves the classification accuracy.

[1]  Mark D. Plumbley,et al.  Attention-based convolutional neural networks for acoustic scene classification , 2018, DCASE.

[2]  Hanseok Ko,et al.  Deep Neural Network Bottleneck Features for Acoustic Event Recognition , 2016, INTERSPEECH.

[3]  M. Kosmider,et al.  CALIBRATING NEURAL NETWORKS FOR SECONDARY RECORDING DEVICES Technical Report , 2019 .

[4]  Chin-Hui Lee,et al.  An Acoustic Segment Model Based Segment Unit Selection Approach to Acoustic Scene Classification with Partial Utterances , 2020, INTERSPEECH.

[5]  Hu Hu,et al.  Generative Adversarial Networks Based Data Augmentation for Noise Robust Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[7]  Gerhard Widmer,et al.  The Receptive Field as a Regularizer in Deep Convolutional Neural Networks for Acoustic Scene Classification , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[8]  Zhong Meng,et al.  L-Vector: Neural Label Embedding for Domain Adaptation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[10]  Chin-Hui Lee,et al.  Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation , 2020, ArXiv.

[11]  Vinayak Abrol,et al.  ASe: Acoustic Scene Embedding Using Deep Archetypal Analysis and GMM , 2018, INTERSPEECH.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[14]  Hye-jin Shim,et al.  Acoustic scene classification using teacher-student learning with soft-labels , 2019, INTERSPEECH.

[15]  Jun Wang,et al.  SELF-ATTENTION MECHANISM BASED SYSTEM FOR DCASE 2018 CHALLENGE TASK 1 AND TASK 4 , 2018 .

[16]  Daniele Battaglino,et al.  Acoustic scene classification using convolutional neural networks , 2016 .

[17]  Haibo Mi,et al.  Mixup-Based Acoustic Scene Classification Using Multi-Channel Convolutional Neural Network , 2018, PCM.

[18]  Chin-Hui Lee,et al.  Improving Audio-visual Speech Recognition Performance with Cross-modal Student-teacher Training , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Suwon Shon,et al.  Domain Mismatch Robust Acoustic Scene Classification Using Channel Information Conversion , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Paul Magron,et al.  Unsupervised Adversarial Domain Adaptation Based on The Wasserstein Distance For Acoustic Scene Classification , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[21]  Kyogu Lee,et al.  Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation , 2016, ArXiv.

[22]  Yan Lu,et al.  Relational Knowledge Distillation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[24]  Dmitriy Serdyuk,et al.  Unsupervised adversarial domain adaptation for acoustic scene classification , 2018, ArXiv.

[25]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[26]  Lukás Burget,et al.  Convolutional Neural Networks and x-vector Embedding for DCASE2018 Acoustic Scene Classification Challenge , 2018, ArXiv.

[27]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[28]  Yonghong Yan,et al.  Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling , 2019, ArXiv.

[29]  Justin Salamon,et al.  Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Hongwei Song,et al.  Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events , 2019, INTERSPEECH.

[31]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Yifan Gong,et al.  Large-Scale Domain Adaptation via Teacher-Student Learning , 2017, INTERSPEECH.

[33]  Gerhard Widmer,et al.  Exploiting Parallel Audio Recordings to Enforce Device Invariance in CNN-based Acoustic Scene Classification , 2019, DCASE.

[34]  Franz Pernkopf,et al.  Acoustic Scene Classification with Mismatched Devices Using CliqueNets and Mixup Data Augmentation , 2019, INTERSPEECH.

[35]  Gerhard Widmer,et al.  CP-JKU SUBMISSIONS FOR DCASE-2016 : A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS , 2016 .

[36]  Chin-Hui Lee,et al.  High-Resolution Attention Network with Acoustic Segment Model for Acoustic Scene Classification , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).