Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models

Continued improvements in machine learning techniques offer exciting new opportunities through the use of larger models and larger training datasets. However, there is a growing need to offer these new capabilities on-board low-powered devices such as smartphones, wearables and other embedded environments where only low memory is available. Towards this, we consider methods to reduce the model size of Conformer-based speech recognition models which typically require models with greater than 100M parameters down to just $5$M parameters while minimizing impact on model quality. Such a model allows us to achieve always-on ambient speech recognition on edge devices with low-memory neural processors. We propose model weight reuse at different levels within our model architecture: (i) repeating full conformer block layers, (ii) sharing specific conformer modules across layers, (iii) sharing sub-components per conformer module, and (iv) sharing decomposed sub-component weights after low-rank decomposition. By sharing weights at different levels of our model, we can retain the full model in-memory while increasing the number of virtual transformations applied to the input. Through a series of ablation studies and evaluations, we find that with weight sharing and a low-rank architecture, we can achieve a WER of 2.84 and 2.94 for Librispeech dev-clean and test-clean respectively with a $5$M parameter model.

[1]  Yuhao Zhang,et al.  An Ultra-low Power TinyML System for Real-time Visual Processing at Edge , 2022, IEEE Transactions on Circuits and Systems II: Express Briefs.

[2]  Steven M. Hernandez,et al.  WiFi Sensing on the Edge: Signal Processing Techniques and Challenges for Real-World Systems , 2023, IEEE Communications Surveys & Tutorials.

[3]  Hongxia Jin,et al.  Language model compression with weighted low-rank factorization , 2022, ICLR.

[4]  Clemens J. S. Schaefer,et al.  Edge Inference with Fully Differentiable Quantized Mixed Precision Neural Networks , 2022, ArXiv.

[5]  Yanzhang He,et al.  4-bit Conformer with Native Quantization Aware Training for Speech Recognition , 2022, INTERSPEECH.

[6]  Zhangyang Wang,et al.  Audio Lottery: Speech Recognition Made Ultra-Lightweight, Noise-Robust, and Transferable , 2022, ICLR.

[7]  Andreas Stolcke,et al.  Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End , 2021, Interspeech.

[8]  Vincent Gripon,et al.  Quantization and Deployment of Deep Neural Networks on Microcontrollers , 2021, Sensors.

[9]  Ding Zhao,et al.  Dynamic Sparsity Neural Networks for Automatic Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[11]  Yonghui Wu,et al.  ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context , 2020, INTERSPEECH.

[12]  David Patterson,et al.  Benchmarking TinyML Systems: Challenges and Direction , 2020, ArXiv.

[13]  Ming Gong,et al.  Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System , 2019, WSDM.

[14]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[15]  Alessandro Montanari,et al.  Resource Characterisation of Personal-Scale Sensing Models on Edge Accelerators , 2019, Proceedings of the First International Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Things.

[16]  Alessio Brutti,et al.  Neural Network Distillation on IoT Platforms for Sound Event Detection , 2019, INTERSPEECH.

[17]  Alvarez Raziel,et al.  End-to-end Streaming Keyword Spotting , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Atsushi Fujita,et al.  Recurrent Stacking of Layers for Compact Neural Machine Translation Models , 2018, AAAI.

[19]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Adaptive Quantization for Deep Neural Network , 2017, AAAI.

[20]  Dacheng Tao,et al.  On Compressing Deep Models by Low Rank and Sparse Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[23]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, ArXiv.

[24]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[25]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Yifan Gong,et al.  Restructuring of deep neural network acoustic models with singular value decomposition , 2013, INTERSPEECH.