ASR2K: Speech Recognition for Around 2000 Languages without Audio

Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three com-ponents: acoustic, pronunciation, and language models. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision. The language model is built using n-gram statistics or the raw text dataset. We build speech recognition for 1909 languages by combining it with Crúbadán: a large endangered languages n-gram database. Furthermore, we test our approach on 129 languages across two datasets: Common Voice and CMU Wilderness dataset. We achieve 50% CER and 74% WER on the Wilderness dataset with Crúbadán statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.

[1]  David R. Mortensen,et al.  Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble , 2022, FINDINGS.

[2]  Florian Metze,et al.  Hierarchical Phone Recognition with Compositional Phonetics , 2021, Interspeech.

[3]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Michael Auli,et al.  Unsupervised Speech Recognition , 2021, NeurIPS.

[5]  Andy T. Liu,et al.  SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[6]  Ronan Collobert,et al.  Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[7]  Tie-Yan Liu,et al.  LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition , 2020, KDD.

[8]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[9]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[10]  Florian Metze,et al.  AlloVera: A Multilingual Allophone Database , 2020, LREC.

[11]  Alan W Black,et al.  Universal Phone Recognition with a Multilingual Allophone System , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2019, LREC.

[13]  Edouard Grave,et al.  End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[14]  Bhuvana Ramabhadran,et al.  Speech Recognition with Augmented Synthesized Speech , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[16]  Florian Metze,et al.  Multilingual Speech Recognition with Corpus Relatedness Sampling , 2019, INTERSPEECH.

[17]  Haizhou Li,et al.  VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019 , 2019, INTERSPEECH.

[18]  Alan W. Black,et al.  CMU Wilderness Multilingual Speech Dataset , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Ron J. Weiss,et al.  Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[21]  Siddharth Dalmia,et al.  Epitran: Precision G2P for Many Languages , 2018, LREC.

[22]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[23]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[24]  Lukás Burget,et al.  Semi-Supervised DNN Training with Word Selection for ASR , 2017, INTERSPEECH.

[25]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[26]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[27]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Harald Hammarström,et al.  Glottolog/Langdoc: Defining Dialects, Languages, and Language Families as Collections of Resources , 2011, LISC.

[29]  Kevin P. Scannell The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[30]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[31]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .