Two-Stage Domain Adaptation for Sound Event Detection

S ound event detection under real scenarios is a challenge task. Due to the great distribution mismatch of synthetic and real audio data, the performance of sound event detection model, which is trained on strong-labeled synthetic data, degrades dramatically when it is applied in real environment. To tackle the issue and improve the robustness of sound event detection model, we propose a two-stage domain adaptation sound event detection approach in this paper. The backbone convolutional recurrent neural network (CRNN) leaned using strong-labeled synthetic data is updated by weak-label supervised adaptation and frame-level adversarial domain adaptation. As a result, the parameters of CRNN are renewed for real audio data, and the input space distribution mismatch between synthetic and real audio data is mitigated in the feature space of CRNN. Moreover, a context clip-level consistency regularization between the classification outputs of CNN and CRNN is introduced to improve the feature representation ability of convolutional layers in CRNN. Experiments on DCASE 2019 sound event detection in domestic environments task demonstrate the superiority of our proposed domain adaptation approach. Our approach achieves F1 scores of 48.3% on the validation set and 49.4% on the evaluation set, which are the-state-of-art sound event detection performances of CRNN model without data augmentation.

[1]  Teck Kai Chan,et al.  Non-Negative Matrix Factorization-Convolutional Neural Network (NMF-CNN) For Sound Event Detection , 2020, DCASE.

[2]  Gerhard Widmer,et al.  Exploiting Parallel Audio Recordings to Enforce Device Invariance in CNN-based Acoustic Scene Classification , 2019, DCASE.

[3]  Huibin Lin,et al.  HODGEPODGE: Sound Event Detection Based on Ensemble of Semi-Supervised Learning Methods , 2019, DCASE.

[4]  Ankit Shah,et al.  Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis , 2019, DCASE.

[5]  Xiangdong Wang,et al.  Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Florian Metze,et al.  A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Maximo Cobos,et al.  Adaptive Mid-Term Representations for Robust Audio Event Classification , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Shabnam Ghaffarzadegan,et al.  Learning Front-end Filter-bank Parameters using Convolutional Neural Networks for Abnormal Heart Sound Detection , 2018, 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[9]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Anurag Kumar,et al.  Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Michael I. Jordan,et al.  Conditional Adversarial Domain Adaptation , 2017, NeurIPS.

[12]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[13]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[15]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[16]  Michael I. Jordan,et al.  Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[17]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[18]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[19]  Lionel Delphin-Poulat,et al.  MEAN TEACHER WITH DATA AUGMENTATION FOR DCASE 2019 TASK 4 Technical Report , 2019 .

[20]  Youngho Jeong,et al.  SOUND EVENT DETECTION IN DOMESTIC ENVIRONMENTS USING ENSEMBLE OF CONVOLUTIONAL RECURRENT NEURAL NETWORKS Technical Report , 2019 .

[21]  Lu Jiakai,et al.  MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4 , 2018 .