Teacher-student Training for Acoustic Event Detection Using Audioset

This paper studies Acoustic Event Detection (AED) systems and the problem of their rapid and easy customisation to arbitrary deployment scenarios. Due to inherent challenges related to annotation processes of AED data (time-consuming and error-prone due to often unclear time-stamping), most of the available large-scale datasets for AED are released with weak clip-level labels, which also affects how one should design weakly-supervised training procedures. In this paper, we investigate a teacher-student training approach of learning low-complexity student models, using large teachers. We first show that state-of-the-art performance can be achieved by a Convolutional Neural Network (CNN) model with appropriate attention mechanism. Then we describe a framework that enables learning arbitrary small-footprint, generic or domain-expert, AED systems from generic teachers. We carry experiments on Audioset - a large-scale weakly labelled dataset of acoustic events.

[1]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Thomas S. Huang,et al.  Real-world acoustic event detection , 2010, Pattern Recognit. Lett..

[4]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[5]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[6]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[7]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[8]  Raymond W. M. Ng,et al.  Teacher-Student Training for Text-Independent Speaker Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[9]  Andrey Temko,et al.  CLEAR Evaluation of Acoustic Event Detection and Classification Systems , 2006, CLEAR.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yong Xu,et al.  Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Bin Yang,et al.  Multi-level attention model for weakly supervised audio classification , 2018, DCASE.

[16]  Tuomas Virtanen,et al.  Acoustic event detection in real life recordings , 2010, 2010 18th European Signal Processing Conference.

[17]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[18]  Bhiksha Raj,et al.  A Closer Look at Weak Label Learning for Audio Events , 2018, ArXiv.

[19]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[20]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Yong Xu,et al.  Surrey-cvssp system for DCASE2017 challenge task4 , 2017, ArXiv.

[22]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[23]  Qiang Huang,et al.  Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging , 2017, INTERSPEECH.