Enabling Early Audio Event Detection with Neural Networks

This paper presents a methodology for early detection of audio events from audio streams. Early detection is the ability to infer an ongoing event during its initial stage. The proposed system consists of a novel inference step coupled with dual parallel tailored-loss deep neural networks (DNNs). The DNNs share a similar architecture except for their loss functions, i.e. weighted loss and multitask loss, which are designed to efficiently cope with issues common to audio event detection. The inference step is newly introduced to make use of the network outputs for recognizing ongoing events. The monotonicity of the detection function is required for reliable early detection, and will also be proved. Experiments on the ITC-Irst database show that the proposed system achieves state-of-the-art detection performance. Furthermore, even partial events are sufficient to achieve good performance similar to that obtained when an entire event is observed, enabling early event detection.

[1]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[2]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[3]  Ian McLoughlin,et al.  What makes audio event detection harder than classification? , 2016, 2017 25th European Signal Processing Conference (EUSIPCO).

[4]  Laurent Girin,et al.  Deep neural networks for automatic detection of screams and shouted speech in subway trains , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Bhiksha Raj,et al.  Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data , 2017, ArXiv.

[6]  Dan Stowell,et al.  Detection and Classification of Acoustic Scenes and Events , 2015, IEEE Transactions on Multimedia.

[7]  Yan Song,et al.  Robust sound event recognition using convolutional neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Luc Van Gool,et al.  Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection , 2016, ArXiv.

[10]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[12]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Huy Phan,et al.  DNN and CNN with Weighted and Multi-task Loss Functions for Audio Event Detection , 2017, ArXiv.

[14]  Heikki Huttunen,et al.  Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Huy Phan,et al.  Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks , 2016, INTERSPEECH.

[16]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[17]  Andrey Temko,et al.  CLEAR Evaluation of Acoustic Event Detection and Classification Systems , 2006, CLEAR.

[18]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[19]  Huy Phan,et al.  Early event detection in audio streams , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[20]  Huy Phan,et al.  Random Regression Forests for Acoustic Event Detection and Classification , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Yan Song,et al.  Robust Sound Event Classification Using Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Huy Phan,et al.  CaR-FOREST: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection , 2016, ArXiv.

[23]  Ian McLoughlin,et al.  Continuous robust sound event classification using time-frequency features and deep learning , 2017, PloS one.