Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities

End-to-end multilingual ASR has become more appealing because of several reasons such as simplifying the training and deployment process and positive performance transfer from high-resource to low-resource languages. However, scaling up the number of languages, total hours, and number of unique tokens is not a trivial task. This paper explores large-scale multilingual ASR models on 70 languages. We inspect two architectures: (1) Shared embedding and output and (2) Multiple embedding and output model. In the shared model experiments, we show the importance of tokenization strategy across different languages. Later, we use our optimal tokenization strategy to train multiple embedding and output model to further improve our result. Our multilingual ASR achieves 13.9%-15.6% average WER relative improvement compared to monolingual models. We show that our multilingual ASR generalizes well on an unseen dataset and domain, achieving 9.5% and 7.5% WER on Multilingual Librispeech (MLS) with zero-shot and finetuning, respectively.

[1]  J. Hansen,et al.  Learning ASR pathways: A sparse multilingual ASR model , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Juan Pino,et al.  XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale , 2021, INTERSPEECH.

[3]  Gil Keren,et al.  Scaling ASR Improves Zero and Few Shot Learning , 2021, INTERSPEECH.

[4]  Vikas Joshi,et al.  Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems , 2021, Interspeech.

[5]  Tara N. Sainath,et al.  Scaling End-to-End Models for Large-Scale Multilingual ASR , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[6]  M. Seltzer,et al.  Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Gabriel Synnaeve,et al.  MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[8]  Yue Dong,et al.  Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning , 2020, INTERSPEECH.

[9]  Gabriel Synnaeve,et al.  Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters , 2020, INTERSPEECH.

[10]  Qian Zhang,et al.  Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jisung Wang,et al.  Exploring Lexicon-Free Modeling Units for End-to-End Korean and Korean-English Code-Switching Speech Recognition , 2019, INTERSPEECH.

[12]  Kjell Schubert,et al.  Transformer-Transducer: End-to-End Speech Recognition with Self-Attention , 2019, ArXiv.

[13]  Rohit Prabhavalkar,et al.  On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition , 2019, INTERSPEECH.

[14]  Tara N. Sainath,et al.  Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model , 2019, INTERSPEECH.

[15]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[16]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Shuang Xu,et al.  A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese , 2018, ICONIP.

[18]  Hermann Ney,et al.  Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[19]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[20]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tara N. Sainath,et al.  Multi-Dialect Speech Recognition with a Single Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[23]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[27]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[28]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[29]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[33]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[35]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[36]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.