Accelerating Transducers through Adjacent Token Merging

Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate. However, this design is inefficient, particularly for long speech signals due to the quadratic computation of self-attention. To address this, we propose a new method, Adjacent Token Merging (A-ToMe), which gradually combines adjacent tokens with high similarity scores between their key values. In this way, the total time step could be reduced, and the inference of both the encoder and joint network is accelerated. Experiments on LibriSpeech show that our method can reduce 57% of tokens and improve the inference speed on GPU by 70% without any notable loss of accuracy. Additionally, we demonstrate that A-ToMe is also an effective solution to reduce tokens in long-form ASR, where the input speech consists of multiple utterances.

[1]  Cheng-Yang Fu,et al.  Token Merging: Your ViT But Faster , 2022, ICLR.

[2]  Hung-yi Lee,et al.  On Compressing Sequences for Self-Supervised Speech Models , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[3]  J. Chorowski,et al.  Variable-rate hierarchical CPC leads to acoustic unit discovery in speech , 2022, NeurIPS.

[4]  Michael W. Mahoney,et al.  Squeezeformer: An Efficient Transformer for Automatic Speech Recognition , 2022, NeurIPS.

[5]  Michael Auli,et al.  On-demand compute reduction with stochastic wav2vec 2.0 , 2022, INTERSPEECH.

[6]  Jinyu Li Recent Advances in End-to-End Automatic Speech Recognition , 2021, APSIPA Transactions on Signal and Information Processing.

[7]  Hung-yi Lee,et al.  Distilhubert: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit Bert , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Valentin Vielzeuf,et al.  Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9]  K. Keutzer,et al.  Learned Token Pruning for Transformers , 2021, KDD.

[10]  Takaaki Hori,et al.  Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers , 2021, Interspeech.

[11]  Brian Kingsbury,et al.  Advancing RNN Transducer Technology for Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Ryo Masumura,et al.  Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Hanrui Wang,et al.  SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[14]  Andreas Schwarz,et al.  Improving RNN-T ASR Accuracy Using Context Audio , 2020, Interspeech.

[15]  Jae-Jin Jeon,et al.  Multitask Learning and Joint Optimization for Transformer-RNN-Transducer Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Kyunghyun Cho,et al.  Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search , 2020, ACL.

[17]  Xiao Chen,et al.  Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition , 2020, INTERSPEECH.

[18]  Qian Zhang,et al.  Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Anamitra R. Choudhury,et al.  PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination , 2020, ICML.

[20]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[21]  Tara N. Sainath,et al.  Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[22]  Yifan Gong,et al.  Improving RNN Transducer Modeling for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[23]  Chuang Gan,et al.  Once for All: Train One Network and Specialize it for Efficient Deployment , 2019, ICLR.

[24]  Linhao Dong,et al.  CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[26]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[28]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[29]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[33]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[34]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[36]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[37]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.