PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification, speech emotion classification and sound event detection. Recently, neural networks have been applied to tackle audio pattern recognition problems. However, previous systems are built on specific datasets with limited durations. Recently, in computer vision and natural language processing, systems pretrained on large-scale datasets have generalized well to several tasks. However, there is limited research on pretraining systems on large-scale datasets for audio pattern recognition. In this paper, we propose pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset. These PANNs are transferred to other audio related tasks. We investigate the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks. We propose an architecture called Wavegram-Logmel-CNN using both log-mel spectrogram and waveform as input feature. Our best PANN system achieves a state-of-the-art mean average precision (mAP) of 0.439 on AudioSet tagging, outperforming the best previous system of 0.392. We transfer PANNs to six audio pattern recognition tasks, and demonstrate state-of-the-art performance in several of those tasks. We have released the source code and pretrained models of PANNs: https://github.com/qiuqiangkong/audioset_tagging_cnn.

[1]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[2]  Daniel P. W. Ellis,et al.  Detecting Alarm Sounds , 2001 .

[3]  Jae Hoon Ko,et al.  Classification of snoring sound based on a recurrent neural network , 2019, Expert Syst. Appl..

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[6]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[7]  Ishwar K. Sethi,et al.  Classification of general audio data for content-based retrieval , 2001, Pattern Recognit. Lett..

[8]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[9]  Tuomas Virtanen,et al.  Transfer learning of weakly labelled audio , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[10]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[11]  Dan Stowell,et al.  Detection and Classification of Acoustic Scenes and Events , 2015, IEEE Transactions on Multimedia.

[12]  James R. Glass,et al.  A Deep Residual Network for Large-Scale Acoustic Scene Analysis , 2019, INTERSPEECH.

[13]  Daniel P. W. Ellis,et al.  General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline , 2018, DCASE.

[14]  Jeffrey P. Woodard,et al.  Modeling and classification of natural sounds by product code hidden Markov models , 1992, IEEE Trans. Signal Process..

[15]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[17]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[20]  Yong Xu,et al.  Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems , 2019, ArXiv.

[21]  Xavier Serra,et al.  musicnn: Pre-trained convolutional neural networks for music audio tagging , 2019, ArXiv.

[22]  Tuomas Virtanen,et al.  Acoustic event detection in real life recordings , 2010, 2010 18th European Signal Processing Conference.

[23]  Edith Law,et al.  Input-agreement: a new mechanism for collecting data using human computation games , 2009, CHI.

[24]  Udit Gupta,et al.  ATTENTION-BASED CONVOLUTIONAL NEURAL NETWORK FOR AUDIO EVENT CLASSIFICATION WITH FEATURE TRANSFER LEARNING , 2018 .

[25]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[26]  Yong Xu,et al.  Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Benjamin Schrauwen,et al.  Transfer Learning by Supervised Pre-training for Audio-based Music Classification , 2014, ISMIR.

[28]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[29]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[30]  Mark D. Plumbley,et al.  Weakly Labelled AudioSet Tagging With Attention Neural Networks , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Bin Yang,et al.  Multi-level attention model for weakly supervised audio classification , 2018, DCASE.

[32]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[33]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Yun Wang Polyphonic Sound Event Detection with Weak Labeling , 2017 .

[37]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[38]  William J. Davies,et al.  Generalisation in Environmental Sound Classification: The ‘Making Sense of Sounds’ Data Set and Challenge , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[40]  Juhan Nam,et al.  Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms , 2017, ArXiv.

[41]  P. Karsmakers,et al.  AN MFCC-GMM APPROACH FOR EVENT DETECTION AND CLASSIFICATION , 2013 .

[42]  Heikki Huttunen,et al.  Polyphonic sound event detection using multi label deep neural networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[43]  Yi-Hsuan Yang,et al.  Learning to Recognize Transient Sound Events using Attentional Supervision , 2018, IJCAI.

[44]  Wei Dai,et al.  Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Mark Sandler,et al.  Transfer Learning for Music Classification and Regression Tasks , 2017, ISMIR.

[46]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  M. Kathleen Pichora-Fuller,et al.  Recognition of emotional speech for younger and older talkers: Behavioural findings from the toronto emotional speech set , 2011 .

[48]  Florian Metze,et al.  A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Xavier Serra,et al.  End-to-end Learning for Music Audio Tagging at Scale , 2017, ISMIR.

[50]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[52]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[53]  Yonghong Yan,et al.  Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling , 2019, ArXiv.

[54]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[55]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[56]  Zhang Yi,et al.  Spectrogram based multi-task audio classification , 2017, Multimedia Tools and Applications.

[57]  Hemant A. Patil,et al.  Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification , 2017, INTERSPEECH.

[58]  Buket D. Barkana,et al.  NON-SPEECH ENVIRONMENTAL SOUND CLASSIFICATION USING SVMS WITH A NEW SET OF FEATURES , 2012 .

[59]  Shenglan Liu,et al.  Bottom-up broadcast neural network for music genre classification , 2019, Multimedia Tools and Applications.