An Adversarial Feature Distillation Method for Audio Classification

The audio classification task aims to discriminate between different audio signal types. In this task, deep neural networks have achieved better performance than the traditional shallow architecture-based machine-learning method. However, deep neural networks often require huge computational and storage requirements that hinder the deployment in embedded devices. In this paper, we proposed a distillation method which transfers knowledge from well-trained networks to a small network, and the method can compress model size while improving audio classification precision. The contributions of the proposed method are two folds: a multi-level feature distillation method was proposed and an adversarial learning strategy was employed to improve the knowledge transfer. The extensive experiments are conducted on three audio classification tasks, audio scene classification, general audio tagging, and speech command recognition. The experimental results demonstrate that: the small network can provide better performance while achieves the calculated amount of floating-point operations per second (FLOPS) compression ratio of 76:1 and parameters compression ratio of 3:1.

[1]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2]  Le Zhang,et al.  Ensemble deep learning for regression and time series forecasting , 2014, 2014 IEEE Symposium on Computational Intelligence in Ensemble Learning (CIEL).

[3]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[4]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Geoffrey E. Hinton,et al.  Large scale distributed neural network training through online distillation , 2018, ICLR.

[6]  Haibo Mi,et al.  Mixup-Based Acoustic Scene Classification Using Multi-Channel Convolutional Neural Network , 2018, PCM.

[7]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[8]  Leif E. Peterson K-nearest neighbor , 2009, Scholarpedia.

[9]  Tat-Seng Chua,et al.  Neural Factorization Machines for Sparse Predictive Analytics , 2017, SIGIR.

[10]  Daniel P. W. Ellis,et al.  General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline , 2018, DCASE.

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[12]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[13]  Roger Zimmermann,et al.  Learning and Fusing Multimodal Deep Features for Acoustic Scene Categorization , 2018, ACM Multimedia.

[14]  Eugenio Culurciello,et al.  An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.

[15]  Joydeep Ghosh,et al.  A Text Retrieval Approach to Content-Based Audio Hashing , 2008, International Society for Music Information Retrieval Conference.

[16]  Xavier Serra,et al.  Freesound Datasets: A Platform for the Creation of Open Audio Datasets , 2017, ISMIR.

[17]  Martín Abadi,et al.  Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data , 2016, ICLR.

[18]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[19]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Lie Lu,et al.  Unsupervised content discovery in composite audio , 2005, MULTIMEDIA '05.

[23]  Jangho Kim,et al.  Paraphrasing Complex Network: Network Compression via Factor Transfer , 2018, NeurIPS.

[24]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[25]  Jin Young Choi,et al.  Knowledge Distillation with Adversarial Samples Supporting Decision Boundary , 2018, AAAI.

[26]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[27]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[28]  Gang Chen,et al.  Improve K-means clustering for audio data by exploring a reasonable sampling rate , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[29]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[30]  Yoshua Bengio,et al.  Generative Adversarial Networks , 2014, ArXiv.

[31]  Roger B. Dannenberg,et al.  Segmentation, Clustering, and Display in a Personal Audio Database for Musicians , 2011, ISMIR.

[32]  Victor S. Lempitsky,et al.  Fast ConvNets Using Group-Wise Brain Damage , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Rui Peng,et al.  Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures , 2016, ArXiv.

[34]  Zhiqiang Shen,et al.  MEAL: Multi-Model Ensemble via Adversarial Learning , 2018, AAAI.

[35]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Mohan S. Kankanhalli,et al.  Unsupervised classification of music genre using hidden Markov model , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[37]  Li Deng,et al.  Ensemble deep learning for speech recognition , 2014, INTERSPEECH.