Adaptive Multi-Scale Detection of Acoustic Events

The goal of acoustic (or sound) events detection (AED or SED) is to predict the temporal position of target events in given audio segments. This task plays a significant role in safety monitoring, acoustic early warning and other scenarios. However, the deficiency of data and diversity of acoustic event sources make the AED task a tough issue, especially for prevalent data-driven methods. In this article, we start from analyzing acoustic events according to their time-frequency domain properties, showing that different acoustic events have different time-frequency scale characteristics. Inspired by the analysis, we propose an adaptive multi-scale detection (AdaMD) method. By taking advantage of hourglass neural network and gated recurrent unit (GRU) module, our AdaMD produces multiple predictions at different temporal and frequency resolutions. An adaptive training algorithm is subsequently adopted to combine multi-scale predictions to enhance the overall capability. Experimental results on Detection and Classification of Acoustic Scenes and Events 2017 (DCASE 2017) Task 2, DCASE 2016 Task 3 and DCASE 2017 Task 3 demonstrate that the AdaMD outperforms published state-of-the-art competitors in terms of the metrics of event error rate (ER) and F1-score. The verification experiment on our collected factory mechanical dataset also proves the noise-resistant capability of the AdaMD, providing the possibility for it to be deployed in the complex environment.

[1]  Chao Wang,et al.  A simple model for detection of rare sound events , 2018, INTERSPEECH.

[2]  Andrey Temko,et al.  Classification of acoustic events using SVM-based clustering schemes , 2006, Pattern Recognit..

[3]  Maarten De Vos,et al.  Unifying Isolated and Overlapping Audio Event Detection with Multi-label Multi-task Convolutional Recurrent Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Gang Yu,et al.  Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Jérôme Louradour,et al.  Audio Events Detection in Public Transport Vehicle , 2006, 2006 IEEE Intelligent Transportation Systems Conference.

[6]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[7]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[8]  T. Virtanen,et al.  Convolutional Recurrent Neural Networks for Rare Sound Event Detection , 2017, DCASE.

[9]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[10]  Jing-Ming Guo,et al.  Multi-Person Pose Estimation via Multi-Layer Fractal Network and Joints Kinship Pattern , 2019, IEEE Transactions on Image Processing.

[11]  Toan H. Vu,et al.  ACOUSTIC SCENE AND EVENT RECOGNITION USING RECURRENT NEURAL NETWORKS , 2016 .

[12]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[13]  Franz Pernkopf,et al.  Gated Recurrent Networks applied to Acoustic Scene Classification , 2016, DCASE.

[14]  Erik Marchi,et al.  A novel approach for automatic acoustic novelty detection using a denoising autoencoder with bidirectional LSTM neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Nicolas Turpault,et al.  Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments , 2018, DCASE.

[16]  Mathieu Lagrange,et al.  Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), Budapest, Hungary, 3 Sep 2016. , 2016 .

[17]  Andrey Temko,et al.  Acoustic event detection in meeting-room environments , 2009, Pattern Recognit. Lett..

[18]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[19]  Mike Lewis,et al.  MelNet: A Generative Model for Audio in the Frequency Domain , 2019, ArXiv.

[20]  Xiaowei Zhou,et al.  6-DoF object pose from semantic keypoints , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Huy Phan,et al.  Random Regression Forests for Acoustic Event Detection and Classification , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Benjamin Schrauwen,et al.  Multiscale Approaches To Music Audio Feature Learning , 2013, ISMIR.

[23]  Tuomas Virtanen,et al.  A report on sound event detection with different binaural features , 2017, ArXiv.

[24]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[25]  Il-Young Jeong,et al.  Audio Event Detection Using Multiple-Input Convolutional Neural Network , 2017, DCASE.

[26]  Hei Law,et al.  CornerNet: Detecting Objects as Paired Keypoints , 2018, International Journal of Computer Vision.

[27]  Qingshan Liu,et al.  Stacked Hourglass Network for Robust Facial Landmark Localisation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[28]  P. Karsmakers,et al.  AN MFCC-GMM APPROACH FOR EVENT DETECTION AND CLASSIFICATION , 2013 .

[29]  Tuomas Virtanen,et al.  Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features , 2017, DCASE.

[30]  Tuomas Virtanen,et al.  Acoustic event detection in real life recordings , 2010, 2010 18th European Signal Processing Conference.

[31]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[32]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Heikki Huttunen,et al.  Multi-label vs. combined single-label sound event detection with deep neural networks , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[34]  Daniel P. W. Ellis,et al.  Spectral vs. spectro-temporal features for acoustic event detection , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[35]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[36]  Rui Lu BIDIRECTIONAL GRU FOR SOUND EVENT DETECTION , 2017 .

[37]  Heikki Huttunen,et al.  Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Wei-Qiang Zhang,et al.  Learning How to Listen: A Temporal-Frequential Attention Model for Sound Event Detection , 2018, INTERSPEECH.

[39]  James R. Glass,et al.  Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data , 2018, ArXiv.

[40]  Yuma Koizumi,et al.  Unsupervised Detection of Anomalous Sound Based on Deep Learning and the Neyman–Pearson Lemma , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[41]  Lu Jiakai,et al.  MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4 , 2018 .

[42]  Chao Wang,et al.  R-CRNN: Region-based Convolutional Recurrent Neural Network for Audio Event Detection , 2018, INTERSPEECH.

[43]  Augusto Sarti,et al.  Scream and gunshot detection and localization for audio-surveillance systems , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[44]  Yang Bin,et al.  Audio Events Detection and classification using extended R-FCN Approach , 2017, DCASE.

[45]  S. Squartini,et al.  A HIERARCHIC MULTI-SCALED APPROACH FOR RARE SOUND EVENT DETECTION , 2017 .

[46]  Eduardo A. B. da Silva,et al.  Audio anomaly detection on rotating machinery using image signal processing , 2016, 2016 IEEE 7th Latin American Symposium on Circuits & Systems (LASCAS).

[47]  Guilin Zhang,et al.  Vehicle Pose and Shape Estimation Through Multiple Monocular Vision , 2018, 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO).

[48]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[49]  Laurent Girin,et al.  Deep neural networks for automatic detection of screams and shouted speech in subway trains , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Ian McLoughlin,et al.  What makes audio event detection harder than classification? , 2016, 2017 25th European Signal Processing Conference (EUSIPCO).

[51]  Kyogu Lee,et al.  Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks , 2017, DCASE.

[52]  Heikki Huttunen,et al.  Polyphonic sound event detection using multi label deep neural networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[53]  Guilin Zhang,et al.  Vehicle Three-Dimensional Pose and Shape Estimation from Multiple Monocular Vision , 2018, ArXiv.

[54]  Kyogu Lee,et al.  Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input , 2017, DCASE.

[55]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[57]  Yu Tsao,et al.  Temporal Attentive Pooling for Acoustic Event Detection , 2018, INTERSPEECH.

[58]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[59]  Lie Lu,et al.  A flexible framework for key audio effects detection and auditory context inference , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[60]  David Wood,et al.  Acoustic Signal Processing for Anomaly Detection in Machine Room Environments: Demo Abstract , 2016, BuildSys@SenSys.

[61]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Stefan Goetze,et al.  Detection and Classification of Acoustic Events for In-Home Care , 2011 .

[63]  Onur Dikmen,et al.  Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Nicolai Petkov,et al.  Reliable detection of audio events in highly noisy environments , 2015, Pattern Recognit. Lett..

[66]  Huy Phan,et al.  DNN and CNN with Weighted and Multi-task Loss Functions for Audio Event Detection , 2017, ArXiv.

[67]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[68]  Yong Xu,et al.  Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[69]  Bart Vanrumste,et al.  An exemplar-based NMF approach to audio event detection , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[70]  Joonwhoan Lee,et al.  Domestic Cat Sound Classification Using Learned Features from Deep Neural Nets , 2018, Applied Sciences.