A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling

This paper proposes a network architecture mainly designed for audio tagging, which can also be used for weakly supervised acoustic event detection (AED). The proposed network consists of a modified DenseNet as the feature extractor, and a global average pooling (GAP) layer to predict frame-level labels at inference time. This architecture is inspired by the work proposed by Zhou et al., a well-known framework using GAP to localize visual objects given image-level labels. While most of the previous works on weakly supervised AED used recurrent layers with attention-based mechanism to localize acoustic events, the proposed network directly localizes events using the feature map extracted by DenseNet without any recurrent layers. In the audio tagging task of DCASE 2017, our method significantly outperforms the state-of-the-art method in F1 score by 5.3% on the dev set, and 6.0% on the eval set in terms of absolute values. For weakly supervised AED task in DCASE 2018, our model outperforms the state-of-the-art method in event-based F1 by 8.1% on the dev set, and 0.5% on the eval set in terms of absolute values, by using data augmentation and tri-training to leverage unlabeled data.

[1]  Nicolas Turpault,et al.  Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments , 2018, DCASE.

[2]  Li Ying,et al.  Fully Convolutional DenseNet based polyphonic sound event detection , 2018, 2018 International Conference on Cloud Computing, Big Data and Blockchain (ICCBB).

[3]  Kyogu Lee,et al.  Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input , 2017, DCASE.

[4]  Daniel P. W. Ellis,et al.  General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline , 2018, DCASE.

[5]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[6]  Anurag Kumar,et al.  SeCoST:: Sequential Co-Supervision for Large Scale Weakly Labeled Audio Event Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8]  Chao Wang,et al.  A simple model for detection of rare sound events , 2018, INTERSPEECH.

[9]  Ian McLoughlin,et al.  A Region Based Attention Method for Weakly Supervised Sound Event Detection and Classification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Anurag Kumar,et al.  Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Florian Metze,et al.  A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Yong Xu,et al.  Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Ming Sun,et al.  A Comparison of Pooling Methods on LSTM Models for Rare Acoustic Event Classification , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Yong Xu,et al.  DCASE 2018 Challenge baseline with convolutional neural networks , 2018, ArXiv.

[17]  Wei-Qiang Zhang,et al.  Learning How to Listen: A Temporal-Frequential Attention Model for Sound Event Detection , 2018, INTERSPEECH.

[18]  Chao Wang,et al.  R-CRNN: Region-based Convolutional Recurrent Neural Network for Audio Event Detection , 2018, INTERSPEECH.

[19]  Wenhao Ding,et al.  Multi-Scale Time-Frequency Attention for Acoustic Event Detection , 2019, INTERSPEECH.

[20]  Kyogu Lee,et al.  Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks , 2017, DCASE.

[21]  Kai Yu,et al.  Duration Robust Weakly Supervised Sound Event Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yong Xu,et al.  Capsule Routing for Sound Event Detection , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[26]  Wei-Qiang Zhang,et al.  Hierarchical Pooling Structure for Weakly Labeled Sound Event Detection , 2019, INTERSPEECH.

[27]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[29]  Spyridon Matsoukas,et al.  Semi-supervised Acoustic Event Detection Based on Tri-training , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Lu Jiakai,et al.  MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4 , 2018 .