Multi-Scale Time-Frequency Attention for Acoustic Event Detection

Most attention-based methods only concentrate along the time axis, which is insufficient for Acoustic Event Detection (AED). Meanwhile, previous methods for AED rarely considered that target events possess distinct temporal and frequential scales. In this work, we propose a Multi-Scale Time-Frequency Attention (MTFA) module for AED. MTFA gathers information at multiple resolutions to generate a time-frequency attention mask which tells the model where to focus along both time and frequency axis. With MTFA, the model could capture the characteristics of target events with different scales. We demonstrate the proposed method on Task 2 of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Challenge. Our method achieves competitive results on both development dataset and evaluation dataset.

[1]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[2]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[3]  Kyogu Lee,et al.  Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks , 2017, DCASE.

[4]  Chuang Liu,et al.  IEEE International Conference on Robotics and Biomimetics , 2014 .

[5]  Yuexian Zou,et al.  Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition , 2018, INTERSPEECH.

[6]  Wei-Qiang Zhang,et al.  Learning How to Listen: A Temporal-Frequential Attention Model for Sound Event Detection , 2018, INTERSPEECH.

[7]  Chao Wang,et al.  R-CRNN: Region-based Convolutional Recurrent Neural Network for Audio Event Detection , 2018, INTERSPEECH.

[8]  Dong Yu,et al.  Monaural Multi-Talker Speech Recognition with Attention Mechanism and Gated Convolutional Networks , 2018, INTERSPEECH.

[9]  T. Virtanen,et al.  Convolutional Recurrent Neural Networks for Rare Sound Event Detection , 2017, DCASE.

[10]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[11]  Chao Wang,et al.  A simple model for detection of rare sound events , 2018, INTERSPEECH.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[14]  Stefan Wermter,et al.  Conversational Analysis using Utterance-level Attention-based Bidirectional Recurrent Neural Networks , 2018, INTERSPEECH.

[15]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[19]  Jörn Anemüller,et al.  Automatic acoustic siren detection in traffic noise by part-based models , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[21]  Yangsheng Xu,et al.  Intelligent household surveillance robot , 2009, 2008 IEEE International Conference on Robotics and Biomimetics.

[22]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Nicolai Petkov,et al.  Audio Surveillance of Roads: A System for Detecting Anomalous Sounds , 2016, IEEE Transactions on Intelligent Transportation Systems.