Receptive Field Regularization Techniques for Audio Classification and Tagging With Deep Convolutional Neural Networks

In this paper, we study the performance of variants of well-known Convolutional Neural Network (CNN) architectures on different audio tasks. We show that tuning the Receptive Field (RF) of CNNs is crucial to their generalization. An insufficient RF limits the CNN's ability to fit the training data. In contrast, CNNs with an excessive RF tend to over-fit the training data and fail to generalize to unseen testing data. As state-of-the-art CNN architectures – in computer vision and other domains – tend to go deeper in terms of number of layers, their RF size increases and therefore they degrade in performance in several audio classification and tagging tasks. We study well-known CNN architectures and how their building blocks affect their receptive field. We propose several systematic approaches to control the RF of CNNs and systematically test the resulting architectures on different audio classification and tagging tasks and datasets. The experiments show that regularizing the RF of CNNs using our proposed approaches can drastically improve the generalization of models, out-performing complex architectures and pre-trained models on larger datasets. The proposed CNNs achieve state-of-the-art results in multiple tasks, from acoustic scene classification to emotion and theme detection in music to instrument recognition, as demonstrated by top ranks in several pertinent challenges (DCASE, MediaEval).

[1]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[2]  Yonghong Yan,et al.  Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling , 2019, ArXiv.

[3]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[4]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[5]  Gerhard Widmer,et al.  CP-JKU SUBMISSIONS TO DCASE’20: LOW-COMPLEXITY CROSS-DEVICE ACOUSTIC SCENE CLASSIFICATION WITH RF-REGULARIZED CNNS Technical Report , 2020 .

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Gerhard Widmer,et al.  Exploiting Parallel Audio Recordings to Enforce Device Invariance in CNN-based Acoustic Scene Classification , 2019, DCASE.

[9]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[10]  Juhan Nam,et al.  Sparse feature learning for instrument identification: Effects of sampling and pooling methods. , 2016, The Journal of the Acoustical Society of America.

[11]  Shihab Shamma,et al.  Adaptive auditory computations , 2014, Current Opinion in Neurobiology.

[12]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Raquel Urtasun,et al.  Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[14]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[15]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Joakim Andén,et al.  Extended playing techniques: the next milestone in musical instrument recognition , 2018, DLfm.

[18]  Xavier Serra,et al.  Experimenting with musically motivated convolutional neural networks , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[19]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[20]  Gerhard Widmer,et al.  CP-JKU SUBMISSIONS FOR DCASE-2016 : A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS , 2016 .

[21]  Andre Araujo,et al.  Computing Receptive Fields of Convolutional Neural Networks , 2019, Distill.

[22]  Gerhard Widmer,et al.  CLASSIFYING SHORT ACOUSTIC SCENES WITH I-VECTORS AND CNNS : CHALLENGES AND OPTIMISATIONS FOR THE 2017 DCASE ASC TASK , 2017 .

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Xavier Bresson,et al.  FMA: A Dataset for Music Analysis , 2016, ISMIR.

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Gabriel Falcão Paiva Fernandes,et al.  Enhancing the Labelling of Audio Samples for Automatic Instrument Classification Based on Neural Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[28]  Annamaria Mesaros,et al.  Acoustic Scene Classification in DCASE 2019 Challenge: Closed and Open Set Classification and Data Mismatch Setups , 2019, DCASE.

[29]  Yurong Liu,et al.  A survey of deep neural network architectures and their applications , 2017, Neurocomputing.

[30]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Xavier Serra,et al.  The MTG-Jamendo Dataset for Automatic Music Tagging , 2019, ICML 2019.

[33]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[34]  Alexander Lerch,et al.  Instrument Activity Detection in Polyphonic Music using Deep Neural Networks , 2018, ISMIR.

[35]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[36]  Amir Kenarsari-Anhari Learning Multi-instrument Classification with Partial Labels , 2020, ArXiv.

[37]  Taejin Lee,et al.  Designing Acoustic Scene Classification Models with CNN Variants Technical Report , 2020 .

[38]  Mohit Sharma,et al.  An Attention Mechanism for Musical Instrument Recognition , 2019, ISMIR.

[39]  Xavier Gastaldi,et al.  Shake-Shake regularization of 3-branch residual networks , 2017, ICLR.

[40]  Amir Kenarsari Anhari Learning Multi-instrument Classification with Partial Labels , 2020 .

[41]  S. David,et al.  Auditory attention : focusing the searchlight on sound , 2007 .

[42]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[44]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[45]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[46]  Gerhard Widmer,et al.  CP-JKU SUBMISSIONS TO DCASE ’ 19 : ACOUSTIC SCENE CLASSIFICATION AND AUDIO TAGGING WITH RECEPTIVE-FIELD-REGULARIZED CNNS Technical Report , 2019 .

[47]  Gerhard Widmer,et al.  Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs , 2019, MediaEval.

[48]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[49]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Gerhard Widmer,et al.  The Receptive Field as a Regularizer in Deep Convolutional Neural Networks for Acoustic Scene Classification , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[51]  Focusing attention on sound , 2010, Nature Neuroscience.

[52]  Daniel P. W. Ellis,et al.  Audio tagging with noisy labels and minimal supervision , 2019, DCASE.

[53]  Gerhard Widmer,et al.  ACOUSTIC SCENE CLASSIFICATION WITH REJECT OPTION BASED ON RESNETS , 2019 .

[54]  Yelong Shen,et al.  Learning semantic representations using convolutional neural networks for web search , 2014, WWW.

[55]  Eduardo Coutinho,et al.  Emotion and Themes Recognition in Music Utilising Convolutional and Recurrent Neural Networks , 2019, MediaEval.

[56]  Mounya Elhilali,et al.  A Framework for Speech Activity Detection Using Adaptive Auditory Receptive Fields , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[57]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[58]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Sainath Adapa,et al.  Music Theme Recognition Using CNN and Self-Attention , 2019, MediaEval.

[60]  Brian McFee,et al.  OpenMIC-2018: An Open Data-set for Multiple Instrument Recognition , 2018, ISMIR.

[61]  Gerhard Widmer,et al.  Receptive-field-regularized CNN variants for acoustic scene classification , 2019, DCASE.