Attention-based Atrous Convolutional Neural Networks: Visualisation and Understanding Perspectives of Acoustic Scenes

The goal of Acoustic Scene Classification (ASC) is to recognise the environment in which an audio waveform has been recorded. Recently, deep neural networks have been applied to ASC and have achieved state-of-the-art performance. However, few works have investigated how to visualise and understand what a neural network has learnt from acoustic scenes. Previous work applied local pooling after each convolutional layer, therefore reduced the size of the feature maps. In this paper, we suggest that local pooling is not necessary, but the size of the receptive field is important. We apply atrous Convolutional Neural Networks (CNNs) with global attention pooling as the classification model. The internal feature maps of the attention model can be visualised and explained. On the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 dataset, our proposed method achieves an accuracy of 72.7 %, significantly outperforming the CNNs without dilation at 60.4 %. Furthermore, our results demonstrate that the learnt feature maps contain rich information on acoustic scenes in the time-frequency domain.

[1]  Mark D. Plumbley,et al.  Attention-based convolutional neural networks for acoustic scene classification , 2018, DCASE.

[2]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[3]  Björn W. Schuller,et al.  Snore Sound Classification Using Image-Based Deep Spectrum Features , 2017, INTERSPEECH.

[4]  Dimitrios Tzovaras,et al.  Acoustic Scene Classification: From a Hybrid Classifier to Deep Learning , 2017, DCASE.

[5]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Björn Schuller,et al.  Wavelets Revisited for the Classification of Acoustic Scenes , 2017, DCASE.

[7]  Sven Behnke,et al.  Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition , 2010, ICANN.

[8]  C.-C. Jay Kuo,et al.  Where am I? Scene Recognition for Mobile Robots using Audio Features , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[9]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Arkady B. Zaslavsky,et al.  Context Aware Computing for The Internet of Things: A Survey , 2013, IEEE Communications Surveys & Tutorials.

[12]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[13]  Hervé Glotin,et al.  Introduction to the Special Section on Sound Scene and Event Analysis , 2017, TASLP.

[14]  Kun Qian,et al.  Teaching Machines on Snoring: A Benchmark on Computer Audition for Snore Sound Excitation Localisation , 2018 .

[15]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[16]  Bowen Zhou,et al.  ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs , 2015, TACL.

[17]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[18]  Björn W. Schuller,et al.  Learning Image-based Representations for Heart Sound Classification , 2018, DH.

[19]  Yong Xu,et al.  Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Heikki Huttunen,et al.  Polyphonic sound event detection using multi label deep neural networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[21]  M. Welling,et al.  Region-Based Semantic Segmentation with End-to-End Training , 2016 .

[22]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[23]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[25]  Björn Schuller,et al.  Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio , 2017, DCASE.

[26]  Björn Schuller,et al.  Deep Sequential Image Features on Acoustic Scene Classification , 2017, DCASE.

[27]  Kun Qian,et al.  Deep Scalogram Representations for Acoustic Scene Classification , 2018, IEEE/CAA Journal of Automatica Sinica.