论文信息 - Multi-scale semantic feature fusion and data augmentation for acoustic scene classification

Multi-scale semantic feature fusion and data augmentation for acoustic scene classification

Abstract This paper investigates a multi-scale semantic feature fusion and data augmentation approach for deep convolutional neural network (CNN) based acoustic scene classification. To ensemble the multi-scale semantic information of CNN and improve the performance of acoustic scene classification, a multi-scale feature fusion framework, which consists of a simplified Xception backbone and a semantic feature fusion strategy, is presented. A novel label smoothing mixup data augmentation method, which is a generalization of mixup and label smoothing, is proposed to alleviate the over-confident problem of network training. A spatial-mixup technique is presented to generate meaningful mixup virtual data for acoustic scene classification. Extensive experiments on synthetic data and real acoustic scene classification dataset demonstrate that both multi-scale semantic feature fusion and label smoothing spatial-mixup data augmentation are effective for improving the acoustic scene classification performance of a deep neural network.

[1] Tuomas Virtanen,et al. TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[2] Sridhar Krishnan,et al. Combining Temporal Features by Local Binary Pattern for Acoustic Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[4] Renate Sitte,et al. Comparison of techniques for environmental sound recognition , 2003, Pattern Recognit. Lett..

[5] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[6] Gaël Richard,et al. Feature Learning With Matrix Factorization Applied to Acoustic Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7] Jian Sun,et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2015, IEEE Trans. Pattern Anal. Mach. Intell..

[8] Mark D. Plumbley,et al. Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[9] François Chollet,et al. Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Ronan Collobert,et al. Learning to Refine Object Segments , 2016, ECCV.

[11] Roberto Togneri,et al. Spectrotemporal Analysis Using Local Binary Pattern Variants for Acoustic Scene Classification , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12] Richard F. Lyon,et al. Machine Hearing: An Emerging Field , 2010 .

[13] Alain Rakotomamonjy,et al. Supervised Representation Learning for Audio Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15] Sacha Krstulovic,et al. Automatic Environmental Sound Recognition: Performance Versus Computational Cost , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16] Alain Rakotomamonjy,et al. Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[17] Birger Kollmeier,et al. Classifier Architectures for Acoustic Scenes and Events: Implications for DNNs, TDNNs, and Perceptual Features from DCASE 2016 , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18] Roberto Togneri,et al. Enhanced LBP texture features from time frequency representations for acoustic scene classification , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Shih-Fu Chang,et al. Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification , 2017, IEEE Transactions on Multimedia.

[20] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] Justin Salamon,et al. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[22] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Heikki Huttunen,et al. ACOUSTIC SCENE CLASSIFICATION: A COMPETITION REVIEW , 2018, 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP).

[24] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Luc Van Gool,et al. AENet: Learning Deep Audio Features for Video Analysis , 2017, IEEE Transactions on Multimedia.

[26] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[27] Dan Stowell,et al. Detection and Classification of Acoustic Scenes and Events , 2015, IEEE Transactions on Multimedia.

[28] Dan Stowell,et al. A database and challenge for acoustic scene classification and event detection , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[29] Jason Weston,et al. Vicinal Risk Minimization , 2000, NIPS.

[30] Joan Bruna,et al. Intriguing properties of neural networks , 2013, ICLR.

[31] Kun Qian,et al. Deep Scalogram Representations for Acoustic Scene Classification , 2018, IEEE/CAA Journal of Automatica Sinica.

[32] Xiaodong Cui,et al. Data Augmentation for Deep Neural Network Acoustic Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33] Richard F. Lyon,et al. Machine Hearing: An Emerging Field [Exploratory DSP] , 2010, IEEE Signal Processing Magazine.

[34] Mathieu Lagrange,et al. Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).