Pruning vs XNOR-Net: A Comprehensive Study of Deep Learning for Audio Classification on Edge-devices

Deep learning has celebrated resounding successes in many application areas of relevance to the Internet of Things (IoT), such as computer vision and machine listening. These technologies must ultimately be brought directly to the edge to fully harness the power of deep leaning for the IoT. The obvious challenge is that deep learning techniques can only be implemented on strictly resource-constrained edge devices if the models are radically downsized. This task relies on different model compression techniques, such as network pruning, quantization, and the recent advancement of XNOR-Net. This study examines the suitability of these techniques for audio classification on microcontrollers. We present an application of XNOR-Net for end-to-end raw audio classification and a comprehensive empirical study comparing this approach with pruning-and-quantization methods. We show that raw audio classification with XNOR yields comparable performance to regular full precision networks for small numbers of classes while reducing memory requirements 32-fold and computation requirements 58-fold. However, as the number of classes increases significantly, performance degrades, and pruning-and-quantization based compression techniques take over as the preferred technique being able to satisfy the same space constraints but requiring approximately 8x more computation. We show that these insights are consistent between raw audio classification and image classification using standard benchmark sets. To the best of our knowledge, this is the first study to apply XNOR to end-to-end audio classification and evaluate it in the context of alternative techniques. All codes are publicly available on GitHub.

[1]  Alexander Kolesnikov,et al.  Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jiangbin Zheng,et al.  DeepShip: An underwater acoustic benchmark dataset and a separable convolution based autoencoder for classification , 2021, Expert Syst. Appl..

[3]  James R. Glass,et al.  AST: Audio Spectrogram Transformer , 2021, Interspeech.

[4]  Akon O. Ekpezu,et al.  Using deep learning for acoustic event classification: The case of natural disasters. , 2021, The Journal of the Acoustical Society of America.

[5]  Christoph Bergmeir,et al.  Environmental Sound Classification on the Edge: Deep Acoustic Networks for Extremely Resource-Constrained Devices , 2021, ArXiv.

[6]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[7]  Georgios Tzimiropoulos,et al.  High-Capacity Expert Binary Networks , 2020, ICLR.

[8]  Loris Nanni,et al.  An Ensemble of Convolutional Neural Networks for Audio Classification , 2020, Applied Sciences.

[9]  Adam Byerly,et al.  No routing needed between capsules , 2020, Neurocomputing.

[10]  Martin B.G. Jun,et al.  Development of internal sound sensor using stethoscope and its applications for machine monitoring , 2020, Procedia Manufacturing.

[11]  L. Benini,et al.  Sound event detection with binary neural networks on tightly power-constrained IoT devices , 2020, ISLPED.

[12]  Anurag Kumar,et al.  A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition , 2020, ICML.

[13]  Djamila Aouada,et al.  Structured Compression of Deep Neural Networks with Debiased Elastic Group LASSO , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]  Xiang Zhang,et al.  A Multi-view CNN-based Acoustic Classification System for Automatic Animal Species Identification , 2020, Ad Hoc Networks.

[15]  Vinay P. Namboodiri,et al.  Leveraging Filter Correlations for Deep Model Compression , 2018, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[16]  Jianxin Wu,et al.  AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference , 2018, Pattern Recognit..

[17]  Jianxin Wu,et al.  ThiNet: Pruning CNN Filters for a Thinner Net , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[19]  Yanzhi Wang,et al.  ResNet Can Be Pruned 60×: Introducing Network Purification and Unused Path Removal (P-RM) after Weight Pruning , 2019, 2019 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH).

[20]  Jingyu Wang,et al.  Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion , 2019, Sensors.

[21]  Dan Stowell,et al.  Automatic acoustic identification of individuals in multiple species: improving identification across recording conditions , 2019, Journal of the Royal Society Interface.

[22]  Desheng Li,et al.  Acoustic recordings provide detailed information regarding the behavior of cryptic wildlife to support conservation translocations , 2019, Scientific Reports.

[23]  Tristan M. Behrens,et al.  Applying Sound-Based Analysis at Porsche Production: Towards Predictive Maintenance of Production Machines Using Deep Learning and Internet-of-Things Technology , 2018, Digitalization Cases.

[24]  Shugong Xu,et al.  Learning Attentive Representations for Environmental Sound Classification , 2019, IEEE Access.

[25]  Jonathan J. Huang,et al.  AclNet: efficient end-to-end audio classification CNN , 2018, ArXiv.

[26]  Wei Liu,et al.  Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm , 2018, ECCV.

[27]  Dan Alistarh,et al.  Model compression via distillation and quantization , 2018, ICLR.

[28]  Liang Guo,et al.  A neural network constructed by deep learning technique and its application to intelligent fault diagnosis of machines , 2018, Neurocomputing.

[29]  Julius O. Smith,et al.  Neural Style Transfer for Audio Spectograms , 2018, ArXiv.

[30]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Tatsuya Harada,et al.  Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.

[32]  Elad Eban,et al.  MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Lothar Thiele,et al.  Efficient Convolutional Neural Network For Audio Event Detection , 2017, ArXiv.

[34]  Hemant A. Patil,et al.  Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification , 2017, INTERSPEECH.

[35]  Lonce L. Wyse,et al.  Audio Spectrogram Representations for Processing with Convolutional Neural Networks , 2017, ArXiv.

[36]  Timo Aila,et al.  Pruning Convolutional Neural Networks for Resource Efficient Inference , 2016, ICLR.

[37]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[38]  Wonyong Sung,et al.  Structured Pruning of Deep Convolutional Neural Networks , 2015, ACM J. Emerg. Technol. Comput. Syst..

[39]  Tom J. Moir,et al.  An overview of applications and advancements in automatic sound recognition , 2016, Neurocomputing.

[40]  Luc Van Gool,et al.  Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection , 2016 .

[41]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[42]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[44]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[45]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[46]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.