MTF-CRNN: Multiscale Time-Frequency Convolutional Recurrent Neural Network for Sound Event Detection

To reduce neural network parameter counts and improve sound event detection performance, we propose a multiscale time-frequency convolutional recurrent neural network (MTF-CRNN) for sound event detection. Our goal is to improve sound event detection performance and recognize target sound events with variable duration and different audio backgrounds with low parameter counts. We exploit four groups of parallel and serial convolutional kernels to learn high-level shift-invariant features from the time and frequency domains of acoustic samples. A two-layer bidirectional gated recurrent unit is used to capture the temporal context from the extracted high-level features. The proposed method is evaluated on two different sound event datasets. Compared to that of the baseline method and other methods, the performance is greatly improved as a single model with low parameter counts without pretraining. On the TUT Rare Sound Events 2017 evaluation dataset, our method achieved an error rate (ER) of 0.09±0.01, which was an improvement of 83% compared with the baseline. On the TAU Spatial Sound Events 2019 evaluation dataset, our system achieved an ER of 0.11±0.01, a relative improvement over the baseline of 61%, and F1 and ER values that are better than those of the development dataset. Compared to the state-of-the-art methods, our proposed network achieves competitive detection performance with only one-fifth of the network parameter counts.

[1]  Andrey Temko,et al.  Classification of acoustic events using SVM-based clustering schemes , 2006, Pattern Recognit..

[2]  Ji Wu,et al.  Multi-modal Attention Mechanisms in LSTM and Its Application to Acoustic Scene Classification , 2018, INTERSPEECH.

[3]  Zhou Wang,et al.  Multiscale structural similarity for image quality assessment , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[4]  Mark D. Plumbley,et al.  TWO-STAGE SOUND EVENT LOCALIZATION AND DETECTION USING INTENSITY VECTOR AND GENERALIZED CROSS-CORRELATION Technical Report , 2019 .

[5]  Archontis Politis,et al.  A multi-room reverberant dataset for sound event localization and detection , 2019, DCASE.

[6]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[7]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[8]  Nicolai Petkov,et al.  Audio Surveillance of Roads: A System for Detecting Anomalous Sounds , 2016, IEEE Transactions on Intelligent Transportation Systems.

[9]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[10]  Xavier Serra,et al.  Training Neural Audio Classifiers with Few Data , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Yuantao Gu,et al.  Enhanced Streaming Based Subspace Clustering Applied to Acoustic Scene Data Clustering , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jianxin Wu,et al.  Minimal gated unit for recurrent neural networks , 2016, International Journal of Automation and Computing.

[13]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[14]  Changshui Zhang,et al.  Multi-Scale Recurrent Neural Network for Sound Event Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Ganesh Ramakrishnan,et al.  Time Aggregation Operators for Multi-label Audio Event Detection , 2018, INTERSPEECH.

[16]  Yanxiong Li,et al.  Sound Event Detection with Depthwise Separable and Dilated Convolutions , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[17]  T. Virtanen,et al.  Convolutional Recurrent Neural Networks for Rare Sound Event Detection , 2017, DCASE.

[18]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[19]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  A. Benveniste,et al.  Multiscale system theory , 1990, 29th IEEE Conference on Decision and Control.

[21]  Wei-Qiang Zhang,et al.  Learning How to Listen: A Temporal-Frequential Attention Model for Sound Event Detection , 2018, INTERSPEECH.

[22]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23]  Huy Phan,et al.  Weighted and Multi-Task Loss for Rare Audio Event Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  A. Benveniste,et al.  Multiscale statistical signal processing , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[26]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Liang He,et al.  Multi-Scale Time-Frequency Attention for Rare Sound Event Detection , 2019, ArXiv.

[28]  Jörn Anemüller,et al.  Automatic acoustic siren detection in traffic noise by part-based models , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Kyogu Lee,et al.  Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks , 2017, DCASE.

[30]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[31]  Chao Wang,et al.  R-CRNN: Region-based Convolutional Recurrent Neural Network for Audio Event Detection , 2018, INTERSPEECH.

[32]  Kaiming He,et al.  Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Chao Wang,et al.  A simple model for detection of rare sound events , 2018, INTERSPEECH.

[34]  Uwe Aickelin,et al.  CRNN: A Joint Neural Network for Redundancy Detection , 2016, 2017 IEEE International Conference on Smart Computing (SMARTCOMP).

[35]  Mingxing Xu,et al.  Multi-Scale Convolutional Recurrent Neural Network with Ensemble Method for Weakly Labeled Sound Event Detection , 2019, 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW).

[36]  Mateusz Lewandowski,et al.  Sound source detection, localization and classification using consecutive ensemble of CRNN models , 2019, DCASE.

[37]  Yan Song,et al.  Robust Sound Event Classification Using Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Yang Bin,et al.  Audio Events Detection and classification using extended R-FCN Approach , 2017, DCASE.

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Bicheng Li,et al.  Self-similarity Clustering Event Detection Based on Triggers Guidance , 2009, WISM.

[41]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[42]  Yoshua Bengio,et al.  Gated Feedback Recurrent Neural Networks , 2015, ICML.

[43]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[44]  Yangsheng Xu,et al.  Intelligent household surveillance robot , 2009, 2008 IEEE International Conference on Robotics and Biomimetics.

[45]  Archontis Politis,et al.  Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[46]  Stan Matwin,et al.  Recurrent Neural Networks with Stochastic Layers for Acoustic Novelty Detection , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).