Time Aggregation Operators for Multi-label Audio Event Detection

In this paper, we present a state-of-the-art system for audio event detection. The labels on the training (and evaluation) data specify the set of events occurring in each audio clip, but neither the time spans nor the order in which they occur. Specifically, our task of weakly supervised learning is the “Detection and Classification of Acoustic Scenes and Events (DCASE) 2017” challenge [5]. We use the winning entry in this challenge given by Xu et al. [10] as our starting point and identify several important modifications that allow us to improve on their results significantly. Our techniques pertain to aggregation and consolidation over time and frequency signals over a (temporal) sequence before decoding the labels. In general, our work is also relevant to other tasks involving learning from weak labeling of sequential data.

[1]  Ji Feng,et al.  Deep MIML Network , 2017, AAAI.

[2]  Bin Yang,et al.  Multi-level attention model for weakly supervised audio classification , 2018, DCASE.

[3]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[4]  Tuomas Virtanen,et al.  Sound event detection using spatial features and convolutional recurrent neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[6]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[7]  Yong Xu,et al.  Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Florian Metze,et al.  Multiple Instance Deep Learning for Weakly Supervised Audio Event Detection , 2017, ArXiv.

[10]  Hwee Tou Ng,et al.  A Neural Approach to Automated Essay Scoring , 2016, EMNLP.

[11]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.