Forward-Backward Convolutional Recurrent Neural Networks and Tag-Conditioned Convolutional Neural Networks for Weakly Labeled Semi-supervised Sound Event Detection

In this paper we present our system for thedetection and classi-fication of acoustic scenes and events (DCASE) 2020 ChallengeTask 4: Sound event detection and separation in domestic envi-ronments. We introduce two new models: the forward-backwardconvolutional recurrent neural network (FBCRNN) and the tag-conditioned convolutional neural network (CNN). The FBCRNNemploys two recurrent neural network (RNN) classifiers sharing thesame CNN for preprocessing. With one RNN processing a record-ing in forward direction and the other in backward direction, thetwo networks are trained to jointly predict audio tags, i.e., weak la-bels, at each time step within a recording, given that at each timestep they have jointly processed the whole recording. The pro-posed training encourages the classifiers to tag events as soon aspossible. Therefore, after training, the networks can be appliedto shorter audio segments of, e.g.,200 ms, allowing sound eventdetection (SED). Further, we propose a tag-conditioned CNN tocomplement SED. It is trained to predict strong labels while using(predicted) tags, i.e., weak labels, as additional input. For train-ing pseudo strong labels from a FBCRNN ensemble are used. Thepresented system scored the fourth and third place in the systemsand teams rankings, respectively. Subsequent improvements allowour system to even outperform the challenge baseline and winnersystems in average by, respectively,18.0 %and2.2 %event-basedF1-score on the validation set. Source code is publicly available athttps://github.com/fgnt/pb_sed

[1]  Lu Jiakai,et al.  MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4 , 2018 .

[2]  Yi-Hsuan Yang,et al.  Learning to Recognize Transient Sound Events using Attentional Supervision , 2018, IJCAI.

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Tomoki Toda,et al.  CONVOLUTION-AUGMENTED TRANSFORMER FOR SEMI-SUPERVISED SOUND EVENT DETECTION Technical Report , 2020 .

[5]  Florian Metze,et al.  A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Justin Salamon,et al.  Adaptive Pooling Operators for Weakly Labeled Sound Event Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Tuomas Virtanen,et al.  Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network , 2017, ArXiv.

[10]  Reinhold Häb-Umbach,et al.  Convolutional Recurrent Neural Network and Data Augmentation for Audio Tagging with Noisy Labels and Minimal Supervision , 2019 .

[11]  Nicolas Turpault,et al.  Training Sound Event Detection on a Heterogeneous Dataset , 2020, DCASE.

[12]  Bin Yang,et al.  Multi-level attention model for weakly supervised audio classification , 2018, DCASE.

[13]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[14]  Ankit Shah,et al.  Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis , 2019, DCASE.

[15]  Nicolas Turpault,et al.  Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments , 2018, DCASE.

[16]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[17]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[18]  Tomoki Toda,et al.  Weakly-Supervised Sound Event Detection with Self-Attention , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Mark D. Plumbley,et al.  Computational Analysis of Sound Scenes and Events , 2017 .

[20]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[21]  Bhiksha Raj,et al.  A Closer Look at Weak Label Learning for Audio Events , 2018, ArXiv.

[22]  Xiangdong Wang,et al.  Guided Learning Convolution System for DCASE 2019 Task 4 , 2019, DCASE.

[23]  Xiangdong Wang,et al.  Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.