论文信息 - Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation

Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation

The neural transducer is an end-to-end model for automatic speech recognition (ASR). While the model is well-suited for streaming ASR, the training process remains challenging. During training, the memory requirements may quickly exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence lengths. In this work, we analyze the time and space complexity of a typical transducer training setup. We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample. We present optimizations to increase the efficiency and parallelism of the sample-wise method. In a set of thorough benchmarks, we show that our sample-wise method significantly reduces memory usage, and performs at competitive speed when compared to the default batched computation. As a highlight, we manage to compute the transducer loss and gradients for a batch size of 1024, and audio length of 40 seconds, using only 6 GB of memory.

E. McDermott | Stefan Braun | Roger Hsiao

[1] Daniel Povey,et al. Pruned RNN-T for fast, memory-efficient ASR training , 2022, INTERSPEECH.

[2] Edward Z. Yang,et al. Torchaudio: Building Blocks for Audio and Speech Processing , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Peter C. Ma,et al. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[4] Geoffrey Zweig,et al. Benchmarking LF-MMI, CTC And RNN-T Criteria For Streaming ASR , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[5] Qian Zhang,et al. Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Yifan Gong,et al. Improving RNN Transducer Modeling for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[7] Boris Ginsburg,et al. NeMo: a toolkit for building AI applications using Neural Modules , 2019, ArXiv.

[8] Tara N. Sainath,et al. Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Khe Chai Sim,et al. Efficient Implementation of Recurrent Neural Network Transducer in Tensorflow , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[10] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[11] Rohit Prabhavalkar,et al. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[12] Hairong Liu,et al. Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[15] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[16] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.