Multi-Scale Recurrent Neural Network for Sound Event Detection

Sound event detection (SED) in real life is an interesting but challenging task due to the polyphonic and long-term dependent nature of sound events. Recently, multi-label recurrent neural networks (RNNs) have shown promises. However, even equipped with long short-term memory (LSTM) or gated recurrent unit (GRU) cells, RNNs are still limited to model the long-term dependency. In this paper, we propose a multiscale RNN to address this issue. By integrating information from different time resolutions, we can better capture both the fine-grained and long-term dependencies of sound events. We experiment on the development sets of Task3 of DCASE2016 and DCASE2017. Compared to our previously proposed single-scale RNN that won the third place among the 13 teams in Task3 of DCASE2017, the proposed multiscale model achieves statistically significantly better performance on the development datasets of both DECASE2016 and DCASE2017.

[1]  Heikki Huttunen,et al.  Polyphonic sound event detection using multi label deep neural networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Il-Young Jeong,et al.  Audio Event Detection Using Multiple-Input Convolutional Neural Network , 2017, DCASE.

[4]  Rita Cucchiara,et al.  Hierarchical Boundary-Aware Neural Encoder for Video Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[6]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[7]  Rui Lu BIDIRECTIONAL GRU FOR SOUND EVENT DETECTION , 2017 .

[8]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[9]  George Forman,et al.  Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement , 2010, SKDD.

[10]  Tuomas Virtanen,et al.  A report on sound event detection with different binaural features , 2017, ArXiv.

[11]  Heikki Huttunen,et al.  Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Tuomas Virtanen,et al.  Sound event detection using spatial features and convolutional recurrent neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tuomas Virtanen,et al.  Context-dependent sound event detection , 2013, EURASIP Journal on Audio, Speech, and Music Processing.

[14]  VirtanenTuomas,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017 .

[15]  Jianchao Zhou SOUND EVENT DETECTION IN MULTICHANNEL AUDIO LSTM NETWORK , 2017 .

[16]  Jonathan Krause,et al.  A Hierarchical Approach for Generating Descriptive Image Paragraphs , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Onur Dikmen,et al.  Sound event detection using non-negative dictionaries learned from annotated overlapping events , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[18]  Juhan Nam,et al.  Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging , 2017, IEEE Signal Processing Letters.

[19]  Xinxing Li,et al.  A deep bidirectional long short-term memory based multi-scale approach for music dynamic emotion prediction , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.