Sound Event Detection Using Multiple Optimized Kernels

Sound event detection (SED) has been widely applied in real world applications. Convolutional recurrent neural network based SED approaches have achieved state-of-the-art performance. However, the convolution process is typically performed by using a fixed sized kernel, which adversely affects the detection accuracy especially when the acoustic features of different event classes are characterized by high variations. To deal with this, this article proposes a sound event detection technique using a convolutional recurrent neural network framework with multiple convolutional kernels of different sizes. The top performing kernels are selected from a kernel pool based on the unsupervised clustering errors and the accuracies of the temporarily trained models. Afterwards, the selected kernels are fed to multiple convolution layers to deal with the acoustic feature variations. Experimental results on different subsets of AudioSet, namely the DCASE Challenge 2017 Task 4 and DCASE Challenge 2018 Task 4, demonstrate the performance of the proposed approach compared to state-of-the-art systems.

[1]  Nicolas Turpault,et al.  Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments , 2018, DCASE.

[2]  Roberto Togneri,et al.  Random forest regression based acoustic event detection with bottleneck features , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[3]  Roberto Togneri,et al.  Random forest classification based acoustic event detection , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[4]  Archontis Politis,et al.  Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[5]  Lu Jiakai,et al.  MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4 , 2018 .

[6]  Jun Zhang,et al.  Network traffic clustering using Random Forest proximities , 2013, 2013 IEEE International Conference on Communications (ICC).

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Ferdous Sohel,et al.  Multi-Task Learning for Acoustic Event Detection Using Event and Frame Position Information , 2020, IEEE Transactions on Multimedia.

[9]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Daniel P. W. Ellis,et al.  Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016) , 2016 .

[11]  Yoshua Bengio,et al.  Gated Feedback Recurrent Neural Networks , 2015, ICML.

[12]  Tuomas Virtanen,et al.  Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network , 2017, ArXiv.

[13]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[14]  Tuomas Virtanen,et al.  A report on sound event detection with different binaural features , 2017, ArXiv.

[15]  Yong Xu,et al.  Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Mathieu Lagrange,et al.  Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), Budapest, Hungary, 3 Sep 2016. , 2016 .

[18]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[19]  Mark Hasegawa-Johnson,et al.  Acoustic fall detection using Gaussian mixture models and GMM supervectors , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Kyogu Lee,et al.  Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input , 2017, DCASE.

[21]  Maria E. Niessen,et al.  Hierarchical modeling using automated sub-clustering for sound event recognition , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[22]  Heikki Huttunen,et al.  Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Chng Eng Siong,et al.  Overlapping sound event recognition using local spectrogram features and the generalised hough transform , 2013, Pattern Recognit. Lett..

[24]  Stephen Baek,et al.  Multi-scale Embedded CNN for Music Tagging (MsE-CNN) , 2019, ArXiv.

[25]  Ying Li,et al.  Automatic Kernel Size Determination for Deep Neural Networks Based Hyperspectral Image Classification , 2018, Remote. Sens..

[26]  Il-Young Jeong,et al.  Audio Event Detection Using Multiple-Input Convolutional Neural Network , 2017, DCASE.

[27]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[28]  Roberto Togneri,et al.  Auxiliary Classifier Generative Adversarial Network With Soft Labels in Imbalanced Acoustic Event Detection , 2019, IEEE Transactions on Multimedia.

[29]  Tuomas Virtanen,et al.  Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features , 2017, DCASE.

[30]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Padmanabhan Rajan,et al.  Multiscale CNN based Deep Metric Learning for Bioacoustic Classification: Overcoming Training Data Scarcity Using Dynamic Triplet Loss , 2019, The Journal of the Acoustical Society of America.

[32]  Mingxing Xu,et al.  Multi-Scale Convolutional Recurrent Neural Network with Ensemble Method for Weakly Labeled Sound Event Detection , 2019, 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW).

[33]  Roberto Togneri,et al.  Random forest classification based acoustic event detection utilizing contextual-information and bottleneck features , 2018, Pattern Recognit..