Sound event detection with weakly labelled data
暂无分享,去创建一个
Sound event detection (SED) is a problem to detect the onset and offset times of sound events in an audio recording. SED has many applications in both academia and industry, such as multimedia information retrieval and monitoring domestic and public security. However, compared to speech signal processing that have been researched for many years, the classification and detection of general sounds has not been researched much until recent years.
One limitation of the study on audio classification and sound event detection is that there have been limited datasets public available until the release of the release of the detection and classification of acoustic scenes and events (DCASE) dataset. The DCASE dataset consists of data for acoustic scene classification (ASC), audio tagging (AT) and sound event detection. ASC and AT are tasks to design systems to predict pre-defined labels in an audio clip. SED is a task to design systems to
predict both the presence or absence of sound events in an audio clip as well as the
onset and offset times of the sound events.
One difficulty of the audio classification and SED task is that many datasets
such as the DCASE dataset are weakly labelled. That is, only the presence or
absence of sound events in an audio clip is known, without knowing the onset and
offset annotations of the sound events. This thesis focused on solving the audio
tagging and sound event detection problem using only weakly labelled data. This
thesis proposed attention neural networks to solve the general weakly labelled AT
and SED problem. The attention neural networks can automatically learn to attend
to important segments and ignore silence and irrelevant segments in an audio clip.
We developed a set of weak learning methods for AT and SED using attention neu-
Abstract 3
ral networks. The proposed methods have achieved a state-of-the-art performance
in audio tagging and sound event detection.