Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data

Sound event detection (SED) aims to detect when and recognize what sound events happen in an audio clip. Many supervised SED algorithms rely on strongly labelled data that contains the onset and offset annotations of sound events. However, many audio tagging datasets are weakly labelled, that is, only the presence of the sound events is known, without knowing their onset and offset annotations. In this paper, we propose a time–frequency (T–F) segmentation framework trained on weakly labelled data to tackle the sound event detection and separation problem. In training, a segmentation mapping is applied on a T–F representation, such as log mel spectrogram of an audio clip to obtain T–F segmentation masks of sound events. The T–F segmentation masks can be used for separating the sound events from the background scenes in the T–F domain. Then, a classification mapping is applied on the T–F segmentation masks to estimate the presence probabilities of the sound events. We model the segmentation mapping using a convolutional neural network and the classification mapping using a global weighted rank pooling. In SED, predicted onset and offset times can be obtained from the T–F segmentation masks. As a byproduct, separated waveforms of sound events can be obtained from the T–F segmentation masks. We remixed the DCASE 2018 Task 1 acoustic scene data with the DCASE 2018 Task 2 sound events data. When mixing under 0 dB, the proposed method achieved F1 scores of 0.534, 0.398, and 0.167 in audio tagging, frame-wise SED and event-wise SED, outperforming the fully connected deep neural network baseline of 0.331, 0.237, and 0.120, respectively. In T–F segmentation, we achieved an F1 score of 0.218, where previous methods were not able to do T–F segmentation.

[1]  Hiroshi Sawada,et al.  Reducing musical noise by a fine-shift overlap-add method applied to source separation using a time-frequency mask , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[3]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[5]  Daniel P. W. Ellis,et al.  General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline , 2018, DCASE.

[6]  Heikki Huttunen,et al.  Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Daniel P. W. Ellis,et al.  Detecting Alarm Sounds , 2001 .

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Yi-Hsuan Yang,et al.  Weakly-supervised audio event detection using event-specific Gaussian filters and fully convolutional networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Janto Skowronek,et al.  Automatic surveillance of the acoustic activity in our living environment , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[12]  Augusto Sarti,et al.  Scream and gunshot detection and localization for audio-surveillance systems , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[13]  Yong Xu,et al.  A joint detection-classification model for audio tagging of weakly labelled data , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[15]  M. Hariharan,et al.  Automatic classification of infant cry: A review , 2012, 2012 International Conference on Biomedical Engineering (ICoBE).

[16]  Ali Borji,et al.  Salient Object Detection: A Benchmark , 2015, IEEE Transactions on Image Processing.

[17]  Richard M. Dansereau,et al.  Single-Channel Speech Separation Using Soft Mask Filtering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[19]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[20]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[21]  Qiang Huang,et al.  Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[24]  Yong Xu,et al.  Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Bhiksha Raj,et al.  Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data , 2017, ArXiv.

[26]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[27]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Qiang Huang,et al.  Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging , 2017, INTERSPEECH.

[30]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[31]  Florian Metze,et al.  Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection , 2017, INTERSPEECH.

[32]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[33]  Kyogu Lee,et al.  Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input , 2017, DCASE.

[34]  Dan Stowell,et al.  Detection and classification of acoustic scenes and events: An IEEE AASP challenge , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[35]  Trevor Darrell,et al.  Constrained Convolutional Neural Networks for Weakly Supervised Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Tao Xiang,et al.  In Defence of Negative Mining for Annotating Weakly Labelled Data , 2012, ECCV.

[37]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[38]  Justin Salamon,et al.  Adaptive Pooling Operators for Weakly Labeled Sound Event Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[40]  Mark D. Plumbley,et al.  Deep Neural Network Baseline for DCASE Challenge 2016 , 2016, DCASE.

[41]  Yong Xu,et al.  A Joint Separation-Classification Model for Sound Event Detection of Weakly Labelled Data , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Yong Xu,et al.  Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[45]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[46]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[47]  Fei-Fei Li,et al.  Discriminative Segment Annotation in Weakly Labeled Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[49]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Florian Metze,et al.  Multiple Instance Deep Learning for Weakly Supervised Audio Event Detection , 2017, ArXiv.

[51]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Tuomas Virtanen,et al.  Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network , 2017, ArXiv.

[53]  Jyh-Shing Roger Jang,et al.  FRAMECNN : A WEAKLY-SUPERVISED LEARNING FRAMEWORK FOR FRAME-WISE ACOUSTIC EVENT DETECTION AND CLASSIFICATION , 2017 .

[54]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[55]  Yong Xu,et al.  DCASE 2018 Challenge baseline with convolutional neural networks , 2018, ArXiv.

[56]  Christoph H. Lampert,et al.  Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation , 2016, ECCV.

[57]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[58]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Zhi-Hua Zhou,et al.  Neural Networks for Multi-Instance Learning , 2002 .

[60]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[61]  Jon Barker,et al.  Chime-home: A dataset for sound source recognition in a domestic environment , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[62]  Qiang Huang,et al.  Convolutional gated recurrent neural network incorporating spatial features for audio tagging , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[63]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[64]  Jaume Amores,et al.  Multiple instance classification: Review, taxonomy and comparative study , 2013, Artif. Intell..