A Multi-Resolution Approach to Sound Event Detection in DCASE 2020 Task4

In this paper, we propose a multi-resolution analysis for feature extraction in Sound Event Detection. Because of the specific temporal and spectral characteristics of the different acoustic events, we hypothesize that different time-frequency resolutions can be more appropriate to locate each sound category. We carry out our experiments using the DESED dataset in the context of the DCASE 2020 Task 4 challenge, where the combination of up to five different time-frequency resolutions via model fusion is able to outperform the baseline results. In addition, we propose class-specific thresholds for the F 1 -score metric, further improving the results over the Validation and Public Evaluation sets.

[1]  Justin Salamon,et al.  Sound Event Detection in Synthetic Domestic Environments , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  S. Krstulovic,et al.  A Framework for the Robust Evaluation of Sound Event Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Ankit Shah,et al.  Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis , 2019, DCASE.

[4]  Alicia Lozano-Diez,et al.  Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT , 2018, PloS one.

[5]  Justin Salamon,et al.  Scaper: A library for soundscape synthesis and augmentation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[6]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[7]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .