CAA-Net: Conditional Atrous CNNs With Attention for Explainable Device-Robust Acoustic Scene Classification

Acoustic Scene Classification (ASC) aims to classify the environment in which the audio signals are recorded. Recently, Convolutional Neural Networks (CNNs) have been successfully applied to ASC. However, the data distributions of the audio signals recorded with multiple devices are different. There has been little research on the training of robust neural networks on acoustic scene datasets recorded with multiple devices, and on explaining the operation of the internal layers of the neural networks. In this article, we focus on training and explaining device-robust CNNs on multi-device acoustic scene data. We propose conditional atrous CNNs with attention for multi-device ASC. Our proposed system contains an ASC branch and a device classification branch, both modelled by CNNs. We visualise and analyse the intermediate layers of the atrous CNNs. A time-frequency attention mechanism is employed to analyse the contribution of each time-frequency bin of the feature maps in the CNNs. On the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 ASC dataset, recorded with three devices, our proposed model performs significantly better than CNNs trained on single-device data.

[1]  Changsheng Xu,et al.  Cross-Domain Feature Learning in Multimedia , 2015, IEEE Transactions on Multimedia.

[2]  Colin Raffel,et al.  Onsets and Frames: Dual-Objective Piano Transcription , 2017, ISMIR.

[3]  Hayit Greenspan,et al.  GAN-based Synthetic Medical Image Augmentation for increased CNN Performance in Liver Lesion Classification , 2018, Neurocomputing.

[4]  Maarten De Vos,et al.  DNN Filter Bank Improves 1-Max Pooling CNN for Single-Channel EEG Automatic Sleep Stage Classification , 2018, 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[5]  Mark D. Plumbley,et al.  Attention-based Atrous Convolutional Neural Networks: Visualisation and Understanding Perspectives of Acoustic Scenes , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jaume Amores,et al.  Multiple instance classification: Review, taxonomy and comparative study , 2013, Artif. Intell..

[7]  Yuxin Peng,et al.  Life-long Cross-media Correlation Learning , 2018, ACM Multimedia.

[8]  Björn W. Schuller,et al.  Large-scale audio feature extraction and SVM for acoustic scene classification , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[9]  Dacheng Tao,et al.  Database Saliency for Fast Image Retrieval , 2015, IEEE Transactions on Multimedia.

[10]  Kun Qian,et al.  Teaching Machines on Snoring: A Benchmark on Computer Audition for Snore Sound Excitation Localisation , 2018 .

[11]  Prasanta Kumar Ghosh,et al.  Spectrogram Enhancement Using Multiple Window Savitzky-Golay (MWSG) Filter for Robust Bird Sound Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Ling-Yu Duan,et al.  Unified Spatio-Temporal Attention Networks for Action Recognition in Videos , 2019, IEEE Transactions on Multimedia.

[13]  Shuicheng Yan,et al.  Conditional Convolutional Neural Network for Modality-Aware Face Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Tomoki Toda,et al.  Bidirectional LSTM-HMM Hybrid System for Polyphonic Sound Event Detection , 2016, DCASE.

[15]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[17]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[18]  Eric Martinson,et al.  Robotic Discovery of the Auditory Scene , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[19]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[20]  Kun Qian,et al.  Learning Multi-Resolution Representations for Acoustic Scene Classification via Neural Networks , 2020 .

[21]  Yoshua Bengio,et al.  Professor Forcing: A New Algorithm for Training Recurrent Networks , 2016, NIPS.

[22]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[23]  Gerhard Widmer,et al.  Exploiting Parallel Audio Recordings to Enforce Device Invariance in CNN-based Acoustic Scene Classification , 2019, DCASE.

[24]  Björn Schuller,et al.  Wavelets Revisited for the Classification of Acoustic Scenes , 2017, DCASE.

[25]  Vishal M. Patel,et al.  CNN-Based cascaded multi-task learning of high-level prior and density estimation for crowd counting , 2017, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[26]  Stefano Squartini,et al.  A convolutional neural network approach for acoustic scene classification , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[27]  Yusuke Ijima,et al.  DNN-Based Speech Synthesis Using Speaker Codes , 2018, IEICE Trans. Inf. Syst..

[28]  Huibing Wang,et al.  Deep CNNs With Spatially Weighted Pooling for Fine-Grained Car Recognition , 2017, IEEE Transactions on Intelligent Transportation Systems.

[29]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[31]  Yi-Hsuan Yang,et al.  Weakly-supervised audio event detection using event-specific Gaussian filters and fully convolutional networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Björn Schuller,et al.  Deep Sequential Image Features on Acoustic Scene Classification , 2017, DCASE.

[33]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[34]  Jian Sun,et al.  Convolutional neural networks at constrained time cost , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Tao Xiang,et al.  Bayesian Joint Modelling for Object Localisation in Weakly Labelled Images , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[37]  Ashu Goyal,et al.  Identification of source mobile hand sets using audio latency feature. , 2019, Forensic science international.

[38]  Arkady B. Zaslavsky,et al.  Context Aware Computing for The Internet of Things: A Survey , 2013, IEEE Communications Surveys & Tutorials.

[39]  Eduardo Coutinho,et al.  Dynamic Difficulty Awareness Training for Continuous Emotion Prediction , 2018, IEEE Transactions on Multimedia.

[40]  Huy Phan,et al.  Audio Scene Classification with Deep Recurrent Neural Networks , 2017, INTERSPEECH.

[41]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[42]  Franz Pernkopf,et al.  Acoustic Scene Classification with Mismatched Recording Devices Using Mixture of Experts Layer , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[43]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[44]  Tapio Lokki,et al.  Techniques and Applications of Wearable Augmented Reality Audio , 2003 .

[45]  Zhao Ren,et al.  Exploring Deep Spectrum Representations via Attention-Based Recurrent and Convolutional Neural Networks for Speech Emotion Recognition , 2019, IEEE Access.

[46]  Xin Xu,et al.  Statistical Learning in Multiple Instance Problems , 2003 .

[47]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Anurag Kumar,et al.  Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[50]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[51]  Björn Schuller,et al.  Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio , 2017, DCASE.

[52]  Kun Qian,et al.  Deep Scalogram Representations for Acoustic Scene Classification , 2018, IEEE/CAA Journal of Automatica Sinica.

[53]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[54]  Guillaume Gravier,et al.  One-Step Time-Dependent Future Video Frame Prediction with a Convolutional Encoder-Decoder Neural Network , 2016, ICIAP.

[55]  Alain Trémeau,et al.  Multi-task, multi-domain learning: Application to semantic segmentation and pose regression , 2017, Neurocomputing.

[56]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[57]  Donald A. Adjeroh,et al.  Unified Deep Supervised Domain Adaptation and Generalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[58]  Björn W. Schuller,et al.  Learning Image-based Representations for Heart Sound Classification , 2018, DH.