Tiny Transformers for Environmental Sound Classification at the Edge

With the growth of the Internet of Things and the rise of Big Data, data processing and machine learning applications are being moved to cheap and low size, weight, and power (SWaP) devices at the edge, often in the form of mobile phones, embedded systems, or microcontrollers. The field of CyberPhysical Measurements and Signature Intelligence (MASINT) makes use of these devices to analyze and exploit data in ways not otherwise possible, which results in increased data quality, increased security, and decreased bandwidth. However, methods to train and deploy models at the edge are limited, and models with sufficient accuracy are often too large for the edge device. Therefore, there is a clear need for techniques to create efficient AI/ML at the edge. This work presents training techniques for audio models in the field of environmental sound classification at the edge. Specifically, we design and train Transformers to classify office sounds in audio clips. Results show that a BERTbased Transformer, trained on Mel spectrograms, can outperform a CNN using 99.85% fewer parameters. To achieve this result, we first tested several audio feature extraction techniques designed for Transformers, using ESC-50 for evaluation, along with various augmentations. Our final model outperforms the state-ofthe-art MFCC-based CNN on the office sounds dataset, using just over 6,000 parameters – small enough to run on a microcontroller.

[1]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[2]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[3]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[4]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[5]  Wei Shi,et al.  Dilated convolution neural network with LeakyReLU for environmental sound classification , 2017, 2017 22nd International Conference on Digital Signal Processing (DSP).

[6]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[7]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[8]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[11]  Hemant A. Patil,et al.  Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification , 2017, PReMI.

[12]  Vincent Fontaine,et al.  Automatic classification of environmental noise events by hidden Markov models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[13]  Aditya Khamparia,et al.  Sound Classification Using Convolutional Neural Network and Tensor Deep Stacking Network , 2019, IEEE Access.

[14]  Christian Breiteneder,et al.  Features for Content-Based Audio Retrieval , 2010, Adv. Comput..

[15]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[16]  Ilya Sutskever,et al.  Jukebox: A Generative Model for Music , 2020, ArXiv.

[17]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[18]  Renate Sitte,et al.  Comparison of techniques for environmental sound recognition , 2003, Pattern Recognit. Lett..

[19]  Xiangang Li,et al.  Transformer based unsupervised pre-training for acoustic representation learning , 2020, ArXiv.

[20]  Zohaib Mushtaq,et al.  Spectral images based environmental sound classification using CNN with meaningful data augmentation , 2021 .

[21]  Shugong Xu,et al.  Learning Attentive Representations for Environmental Sound Classification , 2019, IEEE Access.

[22]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Pablo M. Olmos,et al.  Handling Incomplete Heterogeneous Data using VAEs , 2018, Pattern Recognit..

[24]  Carlos E. Otero,et al.  Cyber-Physical Analytics: Environmental Sound Classification at the Edge , 2020, 2020 IEEE 6th World Forum on Internet of Things (WF-IoT).

[25]  Yang Jiao,et al.  Translate Reverberated Speech to Anechoic Ones: Speech Dereverberation with BERT , 2020, ArXiv.

[26]  Ole-Christoffer Granmo,et al.  Environment Sound Classification Using Multiple Feature Channels and Attention Based Deep Convolutional Neural Network , 2020, INTERSPEECH.

[27]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[28]  Lars Lundberg,et al.  Classifying environmental sounds using image recognition networks , 2017, KES.

[29]  S. Squartini,et al.  DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks , 2016, DCASE.

[30]  Hiroyuki Kasai,et al.  Noise-Robust environmental sound classification method based on combination of ICA and MP features , 2013, Artif. Intell. Res..

[31]  Anurag Kumar,et al.  Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Jia Guo,et al.  MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers , 2021, MMM.

[33]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[34]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[35]  Yongqiang Wang,et al.  Weak-Attention Suppression For Transformer Based Speech Recognition , 2020, INTERSPEECH.

[36]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[37]  Dan Stowell,et al.  Detection and classification of acoustic scenes and events: An IEEE AASP challenge , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[38]  Matthias Sperber,et al.  Self-Attentional Acoustic Models , 2018, INTERSPEECH.

[39]  Florian Metze,et al.  A comparison of Deep Learning methods for environmental sound detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Wei Dai,et al.  Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Patrick Cardinal,et al.  End-to-End Environmental Sound Classification using a 1D Convolutional Neural Network , 2019, Expert Syst. Appl..

[42]  Hemant A. Patil,et al.  Novel TEO-based Gammatone features for environmental sound classification , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[43]  Tomoki Toda,et al.  Weakly-Supervised Sound Event Detection with Self-Attention , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[45]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[46]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[47]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[48]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[49]  Tatsuya Harada,et al.  Learning environmental sounds with end-to-end convolutional neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[51]  Jan Niehues,et al.  Very Deep Self-Attention Networks for End-to-End Speech Recognition , 2019, INTERSPEECH.

[52]  Song Han,et al.  MCUNet: Tiny Deep Learning on IoT Devices , 2020, NeurIPS.

[53]  C.-C. Jay Kuo,et al.  Environmental sound recognition: A survey , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[54]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.