HODGEPODGE: Sound Event Detection Based on Ensemble of Semi-Supervised Learning Methods

In this paper, we present a method called HODGEPODGE\footnotemark[1] for large-scale detection of sound events using weakly labeled, synthetic, and unlabeled data proposed in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge Task 4: Sound event detection in domestic environments. To perform this task, we adopted the convolutional recurrent neural networks (CRNN) as our backbone network. In order to deal with a small amount of tagged data and a large amounts of unlabeled in-domain data, we aim to focus primarily on how to apply semi-supervise learning methods efficiently to make full use of limited data. Three semi-supervised learning principles have been used in our system, including: 1) Consistency regularization applies data augmentation; 2) MixUp regularizer requiring that the predictions for a interpolation of two inputs is close to the interpolation of the prediction for each individual input; 3) MixUp regularization applies to interpolation between data augmentations. We also tried an ensemble of various models, which are trained by using different semi-supervised learning principles. Our proposed approach significantly improved the performance of the baseline, achieving the event-based f-measure of 42.0\% compared to 25.8\% event-based f-measure of the baseline in the provided official evaluation dataset. Our submissions ranked third among 18 teams in the task 4.

[1]  Lu Jiakai,et al.  MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4 , 2018 .

[2]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[3]  Yong Xu,et al.  DCASE 2018 Challenge baseline with convolutional neural networks , 2018, ArXiv.

[4]  Yoshua Bengio,et al.  Interpolation Consistency Training for Semi-Supervised Learning , 2019, IJCAI.

[5]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Yong Xu,et al.  DCASE 2018 Challenge Surrey cross-task convolutional neural network baseline , 2018, DCASE.

[7]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[8]  David Berthelot,et al.  MixMatch: A Holistic Approach to Semi-Supervised Learning , 2019, NeurIPS.

[9]  P. van Kranenburg,et al.  International Society for Music Information Retrieval , 2014 .

[10]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[11]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Xavier Serra,et al.  Freesound Datasets: A Platform for the Creation of Open Audio Datasets , 2017, ISMIR.

[13]  Ankit Shah,et al.  Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis , 2019, DCASE.

[14]  Nicolas Turpault,et al.  Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments , 2018, DCASE.

[15]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[16]  Marian Verhelst,et al.  The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network , 2017, DCASE.