Sound Event Detection Using Point-Labeled Data

Sound Event Detection (SED) in audio scenes is the task that has been studied by an increasing number of researchers. Recent SED systems often use deep learning models. Building these systems typically require a large amount of carefully annotated, strongly labeled data, where the exact time-span of a sound event (e.g. the ‘dog bark’ starts at 1.2 seconds and ends at 2.0 seconds) in an audio scene (a recording of a city park) is indicated. However, manual labeling of sound events with their time boundaries within a recording is very time-consuming. One way to solve the issue is to collect data with weak labels that only contain the names of sound classes present in the audio file, without time boundary information for events in the file. Therefore, weakly-labeled sound event detection has become popular recently. However, there is still a large performance gap between models built on weakly labeled data and ones built on strongly labeled data, especially for predicting time boundaries of sound events. In this work, we introduce a new type of sound event label, which is easier for people to provide than strong labels. We call them ‘point labels’. To create a point label, a user simply listens to the recording and hits the space bar if they hear a sound event (’dog bark’). This is much easier to do than specifying exact time boundaries. In this work, we illustrate methods to train a SED model on point-labeled data. Our results show that a model trained on point labeled audio data significantly outperforms weak models and is comparable to a model trained on strongly labeled data.

[1]  Hervé Glotin,et al.  Bird detection in audio: A survey and a challenge , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[2]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[3]  Oded Nov,et al.  SONYC , 2019, Commun. ACM.

[4]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Anurag Kumar,et al.  Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Bryan Pardo,et al.  A Human-in-the-Loop System for Sound Event Detection and Annotation , 2018, ACM Trans. Interact. Intell. Syst..

[7]  Oded Nov,et al.  Seeing Sound , 2017, Proc. ACM Hum. Comput. Interact..

[8]  Yong Xu,et al.  Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[10]  Bhiksha Raj,et al.  Audio event and scene recognition: A unified approach using strongly and weakly labeled data , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[11]  Shabnam Ghaffarzadegan,et al.  Self-supervised Attention Model for Weakly Labeled Audio Event Classification , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[12]  Justin Salamon,et al.  Adaptive Pooling Operators for Weakly Labeled Sound Event Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Yong Xu,et al.  Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Justin Salamon,et al.  Scaper: A library for soundscape synthesis and augmentation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[16]  Augusto Sarti,et al.  Scream and gunshot detection and localization for audio-surveillance systems , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[17]  Oded Nov,et al.  Crowdsourcing Multi-label Audio Annotation Tasks with Citizen Scientists , 2019, CHI.