Crowdsourcing Strong Labels for Sound Event Detection

Strong labels are a necessity for evaluation of sound event detection methods, but often scarcely available due to the high resources required by the annotation task. We present a method for estimating strong labels using crowdsourced weak labels, through a process that divides the annotation task into simple unit tasks. Based on estimations of annotators' competence, aggregation and processing of the weak labels results in a set of objective strong labels. The experiment uses synthetic audio in order to verify the quality of the resulting annotations through comparison with ground truth. The proposed method produces labels with high precision, though not all event instances are recalled. Detection metrics comparing the produced annotations with the ground truth show 80% F-score in 1 s segments, and up to 89.5% intersection-based F1-score calculated according to the polyphonic sound detection score metrics.

[1]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[2]  Sacha Krstulovic,et al.  A Framework for the Robust Evaluation of Sound Event Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Joni-Kristian Kämäräinen,et al.  Combining Multiple Image Segmentations by Maximizing Expert Agreement , 2012, MLMI.

[4]  Justin Salamon,et al.  Scaper: A library for soundscape synthesis and augmentation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[5]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Dirk Hovy,et al.  Learning Whom to Trust with MACE , 2013, NAACL.

[7]  Daniel P. W. Ellis,et al.  Learning Sound Event Classifiers from Web Audio with Noisy Labels , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Oded Nov,et al.  Crowdsourcing Multi-label Audio Annotation Tasks with Citizen Scientists , 2019, CHI.

[9]  Joni-Kristian Kämäräinen,et al.  Fusion of Multiple Expert Annotations and Overall Score Selection for Medical Image Diagnosis , 2009, SCIA.

[10]  Ankit Shah,et al.  Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis , 2019, DCASE.

[11]  Daniel P. W. Ellis,et al.  Datasets and Evaluation , 2018 .

[12]  Annamaria Mesaros,et al.  What is the ground truth? Reliability of multi-annotator data for audio tagging , 2021, 2021 29th European Signal Processing Conference (EUSIPCO).

[13]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Catherine Guastavino,et al.  Everyday Sound Categorization , 2018 .

[15]  Brian McFee,et al.  OpenMIC-2018: An Open Data-set for Multiple Instrument Recognition , 2018, ISMIR.

[16]  J. Ballas Common factors in the identification of an assortment of brief everyday sounds. , 1993, Journal of experimental psychology. Human perception and performance.

[17]  Emmanuel Vincent,et al.  Sound Event Detection in the DCASE 2017 Challenge , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Aren Jansen,et al.  The Benefit of Temporally-Strong Labels in Audio Event Classification , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tuomas Virtanen,et al.  Sound event detection using spatial features and convolutional recurrent neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).