论文信息 - HODGEPODGE: Sound Event Detection Based on Ensemble of Semi-Supervised Learning Methods

HODGEPODGE: Sound Event Detection Based on Ensemble of Semi-Supervised Learning Methods

In this paper, we present a method called HODGEPODGE\footnotemark[1] for large-scale detection of sound events using weakly labeled, synthetic, and unlabeled data proposed in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge Task 4: Sound event detection in domestic environments. To perform this task, we adopted the convolutional recurrent neural networks (CRNN) as our backbone network. In order to deal with a small amount of tagged data and a large amounts of unlabeled in-domain data, we aim to focus primarily on how to apply semi-supervise learning methods efficiently to make full use of limited data. Three semi-supervised learning principles have been used in our system, including: 1) Consistency regularization applies data augmentation; 2) MixUp regularizer requiring that the predictions for a interpolation of two inputs is close to the interpolation of the prediction for each individual input; 3) MixUp regularization applies to interpolation between data augmentations. We also tried an ensemble of various models, which are trained by using different semi-supervised learning principles. Our proposed approach significantly improved the performance of the baseline, achieving the event-based f-measure of 42.0\% compared to 25.8\% event-based f-measure of the baseline in the provided official evaluation dataset. Our submissions ranked third among 18 teams in the task 4.

[1] Lu Jiakai,et al. MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4 , 2018 .

[2] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[3] Yong Xu,et al. DCASE 2018 Challenge baseline with convolutional neural networks , 2018, ArXiv.

[4] Yoshua Bengio,et al. Interpolation Consistency Training for Semi-Supervised Learning , 2019, IJCAI.

[5] Mathieu Lagrange,et al. Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6] Yong Xu,et al. DCASE 2018 Challenge Surrey cross-task convolutional neural network baseline , 2018, DCASE.

[7] Harri Valpola,et al. Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[8] David Berthelot,et al. MixMatch: A Holistic Approach to Semi-Supervised Learning , 2019, NeurIPS.

[9] P. van Kranenburg,et al. International Society for Music Information Retrieval , 2014 .

[10] Ankit Shah,et al. DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[11] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Xavier Serra,et al. Freesound Datasets: A Platform for the Creation of Open Audio Datasets , 2017, ISMIR.

[13] Ankit Shah,et al. Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis , 2019, DCASE.

[14] Nicolas Turpault,et al. Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments , 2018, DCASE.

[15] Annamaria Mesaros,et al. Metrics for Polyphonic Sound Event Detection , 2016 .

[16] Marian Verhelst,et al. The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network , 2017, DCASE.