论文信息 - Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

We introduce Lookup-Table Language Models (LookupLM), a method for scaling up the size of RNN language models with only a constant increase in the floating point operations, by increasing the expressivity of the embedding table. In particular, we instantiate an (additional) embedding table which embeds the previous n-gram token sequence, rather than a single token. This allows the embedding table to be scaled up arbitrarily— with a commensurate increase in performance—without changing the token vocabulary. Since embeddings are sparsely retrieved from the table via a lookup; increasing the size of the table adds neither extra operations to each forward pass nor extra parameters that need to be stored on limited GPU/TPU memory. We explore scaling n-gram embedding tables up to nearly a billion parameters. When trained on a 3-billion sentence corpus, we find that LookupLM improves long tail log perplexity by 2.44 and long tail WER by 23.4% on a downstream speech recognition task over a standard RNN language model baseline, an improvement comparable to a scaling up the baseline by 6.2x the number of floating point operations.

[1] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[2] Tara N. Sainath,et al. Shallow-Fusion End-to-End Contextual Biasing , 2019, INTERSPEECH.

[3] Tara N. Sainath,et al. Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus , 2020, INTERSPEECH.

[4] Quoc V. Le,et al. Listen, Attend and Spell , 2015, ArXiv.

[5] Brian Roark,et al. Neural Models of Text Normalization for Speech Applications , 2019, Computational Linguistics.

[6] Tara N. Sainath,et al. A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[7] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[8] Adam Coates,et al. Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[9] Tara N. Sainath,et al. Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[10] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[11] Chengzhu Yu,et al. Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition , 2019, INTERSPEECH.

[12] Omer Levy,et al. Generalization through Memorization: Nearest Neighbor Language Models , 2020, ICLR.

[13] Xiaofeng Liu,et al. Rnn-Transducer with Stateless Prediction Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[15] D. Sculley,et al. Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[16] Ping Li,et al. Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems , 2020, MLSys.

[17] Ji Ma,et al. Natural Language Processing with Small Feed-Forward Networks , 2017, EMNLP.

[18] José A. R. Fonollosa,et al. Character-based Neural Machine Translation , 2016, ACL.

[19] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[20] Tara N. Sainath,et al. Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[21] Noam Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[22] Tara N. Sainath,et al. An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[24] Cyril Allauzen,et al. Hybrid Autoregressive Transducer (HAT) , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.