Sound event detection with weakly labelled data

Sound event detection (SED) is a problem to detect the onset and offset times of sound events in an audio recording. SED has many applications in both academia and industry, such as multimedia information retrieval and monitoring domestic and public security. However, compared to speech signal processing that have been researched for many years, the classification and detection of general sounds has not been researched much until recent years. One limitation of the study on audio classification and sound event detection is that there have been limited datasets public available until the release of the release of the detection and classification of acoustic scenes and events (DCASE) dataset. The DCASE dataset consists of data for acoustic scene classification (ASC), audio tagging (AT) and sound event detection. ASC and AT are tasks to design systems to predict pre-defined labels in an audio clip. SED is a task to design systems to predict both the presence or absence of sound events in an audio clip as well as the onset and offset times of the sound events. One difficulty of the audio classification and SED task is that many datasets such as the DCASE dataset are weakly labelled. That is, only the presence or absence of sound events in an audio clip is known, without knowing the onset and offset annotations of the sound events. This thesis focused on solving the audio tagging and sound event detection problem using only weakly labelled data. This thesis proposed attention neural networks to solve the general weakly labelled AT and SED problem. The attention neural networks can automatically learn to attend to important segments and ignore silence and irrelevant segments in an audio clip. We developed a set of weak learning methods for AT and SED using attention neu- Abstract 3 ral networks. The proposed methods have achieved a state-of-the-art performance in audio tagging and sound event detection.