Multi-stream Network With Temporal Attention For Environmental Sound Classification

Environmental sound classification systems often do not perform robustly across different sound classification tasks and audio signals of varying temporal structures. We introduce a multi-stream convolutional neural network with temporal attention that addresses these problems. The network relies on three input streams consisting of raw audio and spectral features and utilizes a temporal attention function computed from energy changes over time. Training and classification utilizes decision fusion and data augmentation techniques that incorporate uncertainty. We evaluate this network on three commonly used data sets for environmental sound and audio scene classification and achieve new state-of-the-art performance without any changes in network architecture or front-end preprocessing, thus demonstrating better generalizability.

[1]  Nicholas D. Lane,et al.  DeepEar: robust smartphone audio sensing in unconstrained acoustic environments using deep learning , 2015, UbiComp.

[2]  Bin Yang,et al.  Multi-level attention model for weakly supervised audio classification , 2018, DCASE.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Tatsuya Harada,et al.  Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.

[5]  Hemant A. Patil,et al.  Novel TEO-based Gammatone features for environmental sound classification , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[6]  Yaser Al-Onaizan,et al.  Temporal Attention Model for Neural Machine Translation , 2016, ArXiv.

[7]  Muhammad Huzaifah,et al.  Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks , 2017, ArXiv.

[8]  Shugong Xu,et al.  Deep Convolutional Neural Network with Mixup for Environmental Sound Classification , 2018, PRCV.

[9]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Tatsuya Harada,et al.  Learning environmental sounds with end-to-end convolutional neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jianjun Hu,et al.  An Ensemble Stacked Convolutional Neural Network Model for Environmental Event Sound Recognition , 2018, Applied Sciences.

[13]  Feng Liu,et al.  Learning Environmental Sounds with Multi-scale Convolutional Neural Network , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[14]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[15]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[16]  Abeer Alwan,et al.  Attention Based CLDNNs for Short-Duration Acoustic Scene Classification , 2017, INTERSPEECH.

[17]  Jhing-Fa Wang,et al.  Environmental Sound Classification using Hybrid SVM/KNN Classifier and MPEG-7 Audio Low-Level Descriptor , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[18]  Chao Wang,et al.  A simple model for detection of rare sound events , 2018, INTERSPEECH.

[19]  Lars Lundberg,et al.  Classifying environmental sounds using image recognition networks , 2017, KES.