RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models

With the rapid increase in the size of neural networks, model compression has become an important area of research. Quantization is an effective technique at decreasing the model size, memory access, and compute load of large models. Despite recent advances in quantization aware training (QAT) technique, most papers present evaluations that are focused on computer vision tasks, which have different training dynamics compared to sequence tasks. In this paper, we first benchmark the impact of popular techniques such as straight through estimator, pseudo-quantization noise, learnable scale parameter, clipping, etc. on 4-bit seq2seq models across a suite of speech recognition datasets ranging from 1,000 hours to 1 million hours, as well as one machine translation dataset to illustrate its applicability outside of speech. Through the experiments, we report that noise based QAT suffers when there is insufficient regularization signal flowing back to the quantization scale. We propose low complexity changes to the QAT process to improve model accuracy (outperforming popular learnable scale and clipping methods). With the improved accuracy, it opens up the possibility to exploit some of the other benefits of noise based QAT: 1) training a single model that performs well in mixed precision mode and 2) improved generalization on long form speech recognition.

[1]  Dan Alistarh,et al.  Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning , 2022, NeurIPS.

[2]  Eunhyeok Park,et al.  NIPQ: Noise Injection Pseudo Quantization for Automated DNN Optimization , 2022, ArXiv.

[3]  Joey Tianyi Zhou,et al.  Sharpness-Aware Training for Free , 2022, NeurIPS.

[4]  Tara N. Sainath,et al.  A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes , 2022, INTERSPEECH.

[5]  Yanzhang He,et al.  4-bit Conformer with Native Quantization Aware Training for Speech Recognition , 2022, INTERSPEECH.

[6]  Markus Nagel,et al.  Overcoming Oscillations in Quantization-Aware Training , 2022, ICML.

[7]  J. Malmaud,et al.  Pareto-Optimal Quantized ResNet Is Mostly 4-bit , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Gabriel Synnaeve,et al.  Differentiable Model Compression via Pseudo Quantization Noise , 2021, Trans. Mach. Learn. Res..

[9]  Mohammad Norouzi,et al.  SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network , 2021, ArXiv.

[10]  Michael W. Mahoney,et al.  Integer-Only Zero-Shot Quantization for Efficient Speech Recognition , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Hieu Duy Nguyen,et al.  Quantization Aware Training with Absolute-Cosine Regularization for Automatic Speech Recognition , 2020, INTERSPEECH.

[12]  Liangliang Cao,et al.  Confidence Estimation for Attention-Based Sequence-to-Sequence Models for Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Eirikur Agustsson,et al.  Universally Quantized Neural Compression , 2020, NeurIPS.

[14]  Srikanth Madikeri,et al.  Quantization of Acoustic Model Parameters in Automatic Speech Recognition Framework , 2020, ArXiv.

[15]  Ron Banner,et al.  Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming , 2020, ArXiv.

[16]  Jinyu Li,et al.  On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition , 2020, INTERSPEECH.

[17]  Hermann Ney,et al.  A New Training Pipeline for an Improved Neural Transducer , 2020, INTERSPEECH.

[18]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[19]  Rana Ali Amjad,et al.  Up or Down? Adaptive Rounding for Post-Training Quantization , 2020, ICML.

[20]  Edouard Grave,et al.  Training with Quantization Noise for Extreme Model Compression , 2020, ICLR.

[21]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2019, LREC.

[23]  Yifan Gong,et al.  Improving RNN Transducer Modeling for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[24]  Shaohe Lv,et al.  An Overview of End-to-End Automatic Speech Recognition , 2019, Symmetry.

[25]  C. Dick,et al.  Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks , 2019, MLSys.

[26]  Steven K. Esser,et al.  Learned Step Size Quantization , 2019, ICLR.

[27]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Avi Mendelson,et al.  NICE: Noise Injection and Clamping Estimation for Neural Network Quantization , 2018, Mathematics.

[29]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[30]  Yannick Estève,et al.  TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation , 2018, SPECOM.

[31]  Avi Mendelson,et al.  UNIQ , 2018, ACM Trans. Comput. Syst..

[32]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[33]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[35]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Rohit Prabhavalkar,et al.  On the Efficient Representation and Execution of Deep Acoustic Models , 2016, INTERSPEECH.

[40]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[41]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[42]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[43]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[46]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[47]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[48]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[49]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[50]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[51]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[52]  B. Widrow,et al.  Statistical theory of quantization , 1996 .