E-PANNs: Sound Recognition Using Efficient Pre-trained Audio Neural Networks

Sounds carry an abundance of information about activities and events in our everyday environment, such as traffic noise, road works, music, or people talking. Recent machine learning methods, such as convolutional neural networks (CNNs), have been shown to be able to automatically recognize sound activities, a task known as audio tagging. One such method, pre-trained audio neural networks (PANNs), provides a neural network which has been pre-trained on over 500 sound classes from the publicly available AudioSet dataset, and can be used as a baseline or starting point for other tasks. However, the existing PANNs model has a high computational complexity and large storage requirement. This could limit the potential for deploying PANNs on resource-constrained devices, such as on-the-edge sound sensors, and could lead to high energy consumption if many such devices were deployed. In this paper, we reduce the computational complexity and memory requirement of the PANNs model by taking a pruning approach to eliminate redundant parameters from the PANNs model. The resulting Efficient PANNs (E-PANNs) model, which requires 36\% less computations and 70\% less memory, also slightly improves the sound recognition (audio tagging) performance. The code for the E-PANNs model has been released under an open source license.

[1]  Arshdeep Singh,et al.  Efficient CNNs via Passive Filter Pruning , 2023, ArXiv.

[2]  James R. Glass,et al.  Contrastive Audio-Visual Masked Autoencoder , 2022, ICLR.

[3]  S. Verbitskiy,et al.  ERANNs: Efficient Residual Audio Neural Networks for Audio Pattern Recognition , 2021, Pattern Recognit. Lett..

[4]  Tuomas Virtanen,et al.  Low-Complexity Acoustic Scene Classification for Multi-Device Audio: Analysis of DCASE 2021 Challenge Systems , 2021, DCASE.

[5]  James R. Glass,et al.  AST: Audio Spectrogram Transformer , 2021, Interspeech.

[6]  C. Glossner,et al.  Pruning and Quantization for Deep Neural Network Acceleration: A Survey , 2021, Neurocomputing.

[7]  Arnav Bhavsar,et al.  SVD-based redundancy removal in 1-D CNNs for acoustic scene classification , 2020, Pattern Recognit. Lett..

[8]  Mark D. Plumbley,et al.  PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Kai Chen,et al.  Compressing CNN-DBLSTM models for OCR with teacher-student learning and Tucker decomposition , 2019, Pattern Recognit..

[10]  Daniel P. W. Ellis,et al.  Audio tagging with noisy labels and minimal supervision , 2019, DCASE.

[11]  Arnav Bhavsar,et al.  Deep Hidden Analysis: A Statistical Framework to Prune Feature Maps , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[13]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[14]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[15]  Jianxin Wu,et al.  ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[18]  Ting Liu,et al.  Recent advances in convolutional neural networks , 2015, Pattern Recognit..

[19]  Nando de Freitas,et al.  Predicting Parameters in Deep Learning , 2013, NIPS.