Slow-Fast Auditory Streams for Audio Recognition

We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs. Following similar success in visual recognition, we learn Slow-Fast auditory streams with separable convolutions and multi-level lateral connections. The Slow pathway has high channel capacity while the Fast pathway operates at a fine-grained temporal resolution. We showcase the importance of our two-stream proposal on two diverse datasets: VGG-Sound and EPIC-KITCHENS-100, and achieve state- of-the-art results on both.

[1]  Elia Formisano,et al.  Spectro-Temporal Processing in a Two-Stream Computational Model of Auditory Cortex , 2020, Frontiers in Computational Neuroscience.

[2]  Yonghong Yan,et al.  Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling , 2019, ArXiv.

[3]  Xavier Serra,et al.  Timbre analysis of music audio signals with convolutional neural networks , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[4]  Akshita Gupta,et al.  Acoustic Features Fusion using Attentive Multi-channel Deep Architecture , 2018, 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018).

[5]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Xinxing Chen,et al.  Acoustic scene classification using multi-scale features , 2018, DCASE.

[8]  Chuang Gan,et al.  Deep Audio Priors Emerge From Harmonic Convolutional Networks , 2020, ICLR.

[9]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[10]  Gerhard Widmer,et al.  The Receptive Field as a Regularizer in Deep Convolutional Neural Networks for Acoustic Scene Classification , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[11]  Xinyu Li,et al.  Multi-stream Network With Temporal Attention For Environmental Sound Classification , 2019, INTERSPEECH.

[12]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[14]  Yong Jae Lee,et al.  Audiovisual SlowFast Networks for Video Recognition , 2020, ArXiv.

[15]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[16]  Tan Lee,et al.  Time-Frequency Feature Decomposition Based on Sound Duration for Acoustic Scene Classification , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Taejin Lee,et al.  Designing Acoustic Scene Classification Models with CNN Variants Technical Report , 2020 .

[18]  Chin-Hui Lee,et al.  Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation , 2020, ArXiv.

[19]  Jingyu Wang,et al.  Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion , 2019, Sensors.

[20]  Andrew Zisserman,et al.  Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Kyu J. Han,et al.  State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention with Dilated 1D Convolutions , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[22]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[23]  Mark D. McDonnell,et al.  Acoustic Scene Classification Using Deep Residual Networks with Late Fusion of Separated High and Low Frequency Paths , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  D. Damen,et al.  Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2020, International Journal of Computer Vision.

[25]  Muhammad Huzaifah,et al.  Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks , 2017, ArXiv.

[26]  Essa Yacoub,et al.  Encoding of Natural Sounds at Multiple Spectral and Temporal Resolutions in the Human Auditory Cortex , 2014, PLoS Comput. Biol..

[27]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).