ATReSN-Net: Capturing Attentive Temporal Relations in Semantic Neighborhood for Acoustic Scene Classification

Convolutional Neural Networks (CNNs) have been widely investigated on Acoustic Scene Classification (ASC). Where the convolutional operation can extract useful semantic contents from a local receptive field in the input spectrogram within certain Manhattan distance, i.e., the kernel size. Although stacking multiple convolution layers can increase the range of the receptive field, without explicitly considering the temporal relations of different receptive fields, the increased range is limited around the kernel. In this paper, we propose a 3D CNN for ASC, named ATReSN-Net, which can capture temporal relations of different receptive fields from arbitrary time-frequency locations by mapping the semantic features obtained from the residual block into a semantic space. The ATReSN module has two primary components: first, a k-NN-based grouper for gathering a semantic neighborhood for each feature point in the feature maps. Second, an attentive pooling-based temporal relations aggregator for generating the temporal relations embedding of each feature point and its neighborhood. Experiments showed that our ATReSN-Net outperforms most of the state-ofthe-art CNN models. We shared our code at ATReSN-Net.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Yurii Nesterov Nonsmooth Convex Optimization , 2004 .

[3]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Bo Yang,et al.  RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Ziqiang Shi,et al.  Learning Temporal Relations from Semantic Neighbors for Acoustic Scene Classification , 2020, IEEE Signal Processing Letters.

[7]  Hye-jin Shim,et al.  DNN based multi-level feature ensemble for acoustic scene classification , 2018, DCASE.

[8]  C.-C. Jay Kuo,et al.  Content Analysis for Acoustic Environment Classification in Mobile Robots , 2006, AAAI Fall Symposium: Aurally Informed Performance.

[9]  Bryan Pardo,et al.  Music/Voice Separation Using the Similarity Matrix , 2012, ISMIR.

[10]  Mark D. Plumbley,et al.  Attention-based Atrous Convolutional Neural Networks: Visualisation and Understanding Perspectives of Acoustic Scenes , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[12]  Ziqiang Shi,et al.  Pyramidal Temporal Pooling With Discriminative Mapping for Audio Classification , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Annamaria Mesaros,et al.  Acoustic Scene Classification in DCASE 2019 Challenge: Closed and Open Set Classification and Data Mismatch Setups , 2019, DCASE.

[14]  Yue Wang,et al.  Dynamic Graph CNN for Learning on Point Clouds , 2018, ACM Trans. Graph..

[15]  Franz Pernkopf,et al.  Acoustic scene classification using a convolutional neural network ensemble and nearest neighbor filters , 2018, DCASE.

[16]  Jiqing Han,et al.  Unsupervised Temporal Feature Learning Based on Sparse Coding Embedded BoAW for Acoustic Event Recognition , 2018, INTERSPEECH.

[17]  Jonathan J. Huang,et al.  AclNet: efficient end-to-end audio classification CNN , 2018, ArXiv.

[18]  Andrew Markham,et al.  Robust Attentional Aggregation of Deep Feature Sets for Multi-view 3D Reconstruction , 2018, International Journal of Computer Vision.

[19]  Sungrack Yun,et al.  Acoustic Scene Classification Based on a Large-margin Factorized CNN , 2019, DCASE.

[20]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Hailin Jin,et al.  Learning Video Representations From Correspondence Proposals , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[23]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[24]  Jonathan Huang,et al.  Acoustic Scene Classification Using Deep Learning-based Ensemble Averaging , 2019, DCASE.

[25]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[26]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[28]  Hye-jin Shim,et al.  Acoustic scene classification using teacher-student learning with soft-labels , 2019, INTERSPEECH.

[29]  Matthieu Cord,et al.  Exploring deep vision models for acoustic scene classification , 2018, DCASE.