论文信息 - AclNet: efficient end-to-end audio classification CNN

AclNet: efficient end-to-end audio classification CNN

We propose an efficient end-to-end convolutional neural network architecture, AclNet, for audio classification. When trained with our data augmentation and regularization, we achieved state-of-the-art performance on the ESC-50 corpus with 85:65% accuracy. Our network allows configurations such that memory and compute requirements are drastically reduced, and a tradeoff analysis of accuracy and complexity is presented. The analysis shows high accuracy at significantly reduced computational complexity compared to existing solutions. For example, a configuration with only 155k parameters and 49:3 million multiply-adds per second is 81:75%, exceeding human accuracy of 81:3%. This improved efficiency can enable always-on inference in energy-efficient platforms.

Jonathan J. Huang | Juan Jose Alvarado Leanos | Jonathan Huang

[1] Tatsuya Harada,et al. Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.

[2] Gerhard Widmer,et al. ACOUSTIC SCENE CLASSIFICATION WITH FULLY CONVOLUTIONAL NEURAL NETWORKS AND I-VECTORS Technical Report , 2018 .

[3] Lukás Burget,et al. Convolutional Neural Networks and x-vector Embedding for DCASE2018 Acoustic Scene Classification Challenge , 2018, ArXiv.

[4] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[5] Bo Chen,et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[6] David Gregg,et al. The Movidius Myriad Architecture's Potential for Scientific Computing , 2015, IEEE Micro.

[7] Yifan Gong,et al. Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Anurag Kumar,et al. Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Hemant A. Patil,et al. Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification , 2017, INTERSPEECH.

[10] Hemant A. Patil,et al. Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification , 2017, PReMI.

[11] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[12] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13] Bhiksha Raj,et al. A Closer Look at Weak Label Learning for Audio Events , 2018, ArXiv.

[14] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[15] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[17] Tuomas Virtanen,et al. A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[18] Michael Deisher,et al. Implementation of efficient, low power deep neural networks on next-generation intel client platforms , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Wei Dai,et al. Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Tatsuya Harada,et al. Learning environmental sounds with end-to-end convolutional neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).