Semi-Supervised NMF-CNN for Sound Event Detection

In this paper, a combinative approach using Nonnegative Matrix Factorization (NMF) and Convolutional Neural Network (CNN) is proposed for audio clip Sound Event Detection (SED). The main idea begins with the use of NMF to approximate strong labels for the weakly labeled data. Subsequently, using the approximated strongly labeled data, two different CNNs are trained in a semi-supervised framework where one CNN is used for clip-level prediction and the other for frame-level prediction. Based on this idea, our model can achieve an event-based F1-score of 45.7% on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge Task 4 validation dataset. By ensembling models through averaging the posterior outputs, event-based F1-score can be increased to 48.6%. By comparing with the baseline model, our proposed models outperform the baseline model by over 8%. By testing our models on the DCASE 2020 Challenge Task 4 test set, our models can achieve an event-based F1-score of 44.4% while our ensembled system can achieve an event-based F1-score of 46.3%. Such results have a minimum margin of 7% over the baseline system which demonstrates the robustness of our proposed method on different datasets.

[1]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Xiangdong Wang,et al.  Guided Learning Convolution System for DCASE 2019 Task 4 , 2019, DCASE.

[4]  Colin Raffel,et al.  Realistic Evaluation of Deep Semi-Supervised Learning Algorithms , 2018, NeurIPS.

[5]  Bryan Pardo,et al.  Sound Event Detection Using Point-Labeled Data , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[6]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[7]  Yong Xu,et al.  Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Nicolas Turpault,et al.  Training Sound Event Detection on a Heterogeneous Dataset , 2020, DCASE.

[9]  D. Donoho,et al.  Does median filtering truly preserve edges better than linear filtering , 2006, math/0612422.

[10]  Yong Xu,et al.  Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems , 2019, ArXiv.

[11]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[12]  Zuren Feng,et al.  Adaptive Noise Reduction for Sound Event Detection Using Subband-Weighted NMF † , 2019, Sensors.

[13]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[14]  Teck Kai Chan,et al.  Non-Negative Matrix Factorization-Convolutional Neural Network (NMF-CNN) For Sound Event Detection , 2020, DCASE.

[15]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[16]  Scott Wisdom,et al.  Improving Sound Event Detection in Domestic Environments using Sound Separation , 2020, DCASE.

[17]  Ankit Shah,et al.  Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis , 2019, DCASE.

[18]  Sacha Krstulovic,et al.  A Framework for the Robust Evaluation of Sound Event Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.