The goal of sound event localization and detection (SELD) is detecting the presence of polyphonic sound events and identifying the sources of those events at the same time. In this paper, we propose an entire pipeline, which contains data augmentation, network prediction and post-processing stage, to deal with the SELD task. In data augmentation part, we expand the official dataset with SpecAugment [1]. In network prediction part, we train the event detection network and the localization network separately, and utilize the prediction of events to output localization prediction for active frames. In post-processing part, we propose a prior knowledgebased regularization (PKR), which calculates the average value of the localization prediction of each event segment and replace the prediction of this event with this average value. We theoretically prove that this technique could reduce mean square error (MSE). After evaluating our system on DCASE 2019 Challenge Task 3 Development Dataset, we approximately achieve a 59% reduction in SED error rate (ER) and a 13% reduction in directions-of-arrival (DOA) error over the baseline system (on Ambisonic dataset).
[1]
Taras Butko,et al.
Two-source acoustic event detection and localization: Online implementation in a Smart-room
,
2011,
2011 19th European Signal Processing Conference.
[2]
Toni Hirvonen,et al.
Classification of Spatial Audio Location and Content Using Convolutional Neural Networks
,
2015
.
[3]
Archontis Politis,et al.
Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network
,
2017,
2018 26th European Signal Processing Conference (EUSIPCO).
[4]
Quoc V. Le,et al.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
,
2019,
INTERSPEECH.