Joint Weakly Supervised AT and AED Using Deep Feature Distillation and Adaptive Focal Loss

A good joint training framework is very helpful to improve the performances of weakly supervised audio tagging (AT) and acoustic event detection (AED) simultaneously. In this study, we propose three methods to improve the best teacherstudent framework of DCASE2019 Task 4 for both AT and AED tasks. A frame-level target-events based deep feature distillation is first proposed, it aims to leverage the potential of limited strong-labeled data in weakly supervised framework to learn better intermediate feature maps. Then we propose an adaptive focal loss and two-stage training strategy to enable an effective and more accurate model training, in which the contribution of difficult-to-classify and easy-to-classify acoustic events to the total cost function can be automatically adjusted. Furthermore, an event-specific post processing is designed to improve the prediction of target event time-stamps. Our experiments are performed on the public DCASE2019 Task4 dataset, and results show that our approach achieves competitive performances in both AT (49.8% F1-score) and AED (81.2% F1-score) tasks.

[1]  Lu Jiakai,et al.  MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4 , 2018 .

[2]  Kai Yu,et al.  Duration Robust Weakly Supervised Sound Event Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Wenhao Ding,et al.  Multi-Scale Time-Frequency Attention for Acoustic Event Detection , 2019, INTERSPEECH.

[4]  Raymond W. M. Ng,et al.  Teacher-student Training for Acoustic Event Detection Using Audioset , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Israel Gannot,et al.  A Method for Automatic Fall Detection of Elderly People Using Floor Vibrations and Sound—Proof of Concept on Human Mimicking Doll Falls , 2009, IEEE Transactions on Biomedical Engineering.

[6]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[7]  Tomoki Toda,et al.  Weakly-Supervised Sound Event Detection with Self-Attention , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Lionel Delphin-Poulat,et al.  MEAN TEACHER WITH DATA AUGMENTATION FOR DCASE 2019 TASK 4 Technical Report , 2019 .

[9]  Chao Wang,et al.  Intra-Utterance Similarity Preserving Knowledge Distillation for Audio Tagging , 2020, INTERSPEECH.

[10]  Chuan-Sheng Foo,et al.  Semi-Supervised Audio Classification with Consistency-Based Regularization , 2019, INTERSPEECH.

[11]  Florian Metze,et al.  Event-based Video Retrieval Using Audio , 2012, INTERSPEECH.

[12]  Romain Serizel,et al.  Sound Event Detection from Partially Annotated Data: Trends and Challenges , 2019 .

[13]  Maria E. Niessen,et al.  Monitoring Activities of Daily Living in Smart Homes: Understanding human behavior , 2016, IEEE Signal Processing Magazine.

[14]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[15]  Bhiksha Raj,et al.  A Closer Look at Weak Label Learning for Audio Events , 2018, ArXiv.

[16]  Justin Salamon,et al.  Adaptive Pooling Operators for Weakly Labeled Sound Event Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Nicolas Turpault,et al.  Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments , 2018, DCASE.

[18]  Scott Wisdom,et al.  Improving Sound Event Detection in Domestic Environments using Sound Separation , 2020, DCASE.

[19]  Ankit Shah,et al.  Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis , 2019, DCASE.

[20]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[21]  Xiangdong Wang,et al.  Multi-Branch Learning for Weakly-Labeled Sound Event Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Jinwoo Shin,et al.  Regularizing Class-Wise Predictions via Self-Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Nicu Sebe,et al.  Refine and Distill: Exploiting Cycle-Inconsistency and Knowledge Distillation for Unsupervised Monocular Depth Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yan Song,et al.  Robust Sound Event Classification Using Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Xinyu Li,et al.  Multi-stream Network With Temporal Attention For Environmental Sound Classification , 2019, INTERSPEECH.

[27]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[28]  Xiangdong Wang,et al.  Guided multi-branch learning systems for DCASE 2020 Task 4 , 2020, ArXiv.

[29]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.