What Affects the Performance of Convolutional Neural Networks for Audio Event Classification

Convolutional neural networks (CNN) have played an important role in Audio Event Classification (AEC). Both 1D-CNN and 2D-CNN methods have been applied to improve the classification accuracy of AEC, and there are many factors affecting the performance of models based on CNN. In this paper, we study different factors affecting the performance of CNN for AEC, including sampling rate, signal segmentation methods, window size, mel bins and filter size. The segmentation method of the event signal is an important one among them. It may lead to overfitting problem because audio events usually happen only for a short duration. We propose a signal segmentation method called Fill-length Processing to address the problem. Based on our study of these factors, we design convolutional neural networks for audio event classification (called FPNet). On the environmental sounds dataset ESC-50, the classification accuracies of FPNet-1D and FPNet-2D achieve 73.90% and 85.10% respectively, which improve significantly comparing to the previous methods.

[1]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[2]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[3]  Anurag Kumar,et al.  Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Hemant A. Patil,et al.  Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification , 2017, INTERSPEECH.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[8]  Tatsuya Harada,et al.  Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Juhan Nam,et al.  Raw Waveform-based Audio Classification Using Sample-level CNN Architectures , 2017, NIPS 2017.

[11]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[12]  Jian Huang,et al.  Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function , 2018, INTERSPEECH.

[13]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[14]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[15]  C.-C. Jay Kuo,et al.  Environmental sound recognition: A survey , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[16]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[17]  Michel Vacher,et al.  Sound Classification in a Smart Room Environment: an Approach using GMM and HMM Methods , 2007 .

[18]  Juhan Nam,et al.  Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms , 2017, ArXiv.

[19]  Hemant A. Patil,et al.  Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification , 2017, PReMI.

[20]  Marc-Christoph Gerasch,et al.  Acoustic Scene Classification , 2015 .

[21]  Wei Dai,et al.  Understanding Audio Pattern Using Convolutional Neural Network From Raw Waveforms , 2016, ArXiv.

[22]  Feng Liu,et al.  Learning Environmental Sounds with Multi-scale Convolutional Neural Network , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[23]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Lars Lundberg,et al.  Classifying environmental sounds using image recognition networks , 2017, KES.

[25]  Wei Dai,et al.  Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Tatsuya Harada,et al.  Learning environmental sounds with end-to-end convolutional neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).