Multi-scale semantic feature fusion and data augmentation for acoustic scene classification

Abstract This paper investigates a multi-scale semantic feature fusion and data augmentation approach for deep convolutional neural network (CNN) based acoustic scene classification. To ensemble the multi-scale semantic information of CNN and improve the performance of acoustic scene classification, a multi-scale feature fusion framework, which consists of a simplified Xception backbone and a semantic feature fusion strategy, is presented. A novel label smoothing mixup data augmentation method, which is a generalization of mixup and label smoothing, is proposed to alleviate the over-confident problem of network training. A spatial-mixup technique is presented to generate meaningful mixup virtual data for acoustic scene classification. Extensive experiments on synthetic data and real acoustic scene classification dataset demonstrate that both multi-scale semantic feature fusion and label smoothing spatial-mixup data augmentation are effective for improving the acoustic scene classification performance of a deep neural network.

[1]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[2]  Sridhar Krishnan,et al.  Combining Temporal Features by Local Binary Pattern for Acoustic Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[4]  Renate Sitte,et al.  Comparison of techniques for environmental sound recognition , 2003, Pattern Recognit. Lett..

[5]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[6]  Gaël Richard,et al.  Feature Learning With Matrix Factorization Applied to Acoustic Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2015, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[9]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ronan Collobert,et al.  Learning to Refine Object Segments , 2016, ECCV.

[11]  Roberto Togneri,et al.  Spectrotemporal Analysis Using Local Binary Pattern Variants for Acoustic Scene Classification , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Richard F. Lyon,et al.  Machine Hearing: An Emerging Field , 2010 .

[13]  Alain Rakotomamonjy,et al.  Supervised Representation Learning for Audio Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Sacha Krstulovic,et al.  Automatic Environmental Sound Recognition: Performance Versus Computational Cost , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[17]  Birger Kollmeier,et al.  Classifier Architectures for Acoustic Scenes and Events: Implications for DNNs, TDNNs, and Perceptual Features from DCASE 2016 , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Roberto Togneri,et al.  Enhanced LBP texture features from time frequency representations for acoustic scene classification , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Shih-Fu Chang,et al.  Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification , 2017, IEEE Transactions on Multimedia.

[20]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[22]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Heikki Huttunen,et al.  ACOUSTIC SCENE CLASSIFICATION: A COMPETITION REVIEW , 2018, 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP).

[24]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Luc Van Gool,et al.  AENet: Learning Deep Audio Features for Video Analysis , 2017, IEEE Transactions on Multimedia.

[26]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[27]  Dan Stowell,et al.  Detection and Classification of Acoustic Scenes and Events , 2015, IEEE Transactions on Multimedia.

[28]  Dan Stowell,et al.  A database and challenge for acoustic scene classification and event detection , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[29]  Jason Weston,et al.  Vicinal Risk Minimization , 2000, NIPS.

[30]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[31]  Kun Qian,et al.  Deep Scalogram Representations for Acoustic Scene Classification , 2018, IEEE/CAA Journal of Automatica Sinica.

[32]  Xiaodong Cui,et al.  Data Augmentation for Deep Neural Network Acoustic Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Richard F. Lyon,et al.  Machine Hearing: An Emerging Field [Exploratory DSP] , 2010, IEEE Signal Processing Magazine.

[34]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).