论文信息 - ASR2K: Speech Recognition for Around 2000 Languages without Audio

ASR2K: Speech Recognition for Around 2000 Languages without Audio

Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three com-ponents: acoustic, pronunciation, and language models. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision. The language model is built using n-gram statistics or the raw text dataset. We build speech recognition for 1909 languages by combining it with Crúbadán: a large endangered languages n-gram database. Furthermore, we test our approach on 129 languages across two datasets: Common Voice and CMU Wilderness dataset. We achieve 50% CER and 74% WER on the Wilderness dataset with Crúbadán statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.

David R. Mortensen | A. Black | Florian Metze | Shinji Watanabe | Xinjian Li

[1] David R. Mortensen,et al. Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble , 2022, FINDINGS.

[2] Florian Metze,et al. Hierarchical Phone Recognition with Compositional Phonetics , 2021, Interspeech.

[3] Ruslan Salakhutdinov,et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4] Michael Auli,et al. Unsupervised Speech Recognition , 2021, NeurIPS.

[5] Andy T. Liu,et al. SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[6] Ronan Collobert,et al. Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[7] Tie-Yan Liu,et al. LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition , 2020, KDD.

[8] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[9] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[10] Florian Metze,et al. AlloVera: A Multilingual Allophone Database , 2020, LREC.

[11] Alan W Black,et al. Universal Phone Recognition with a Multilingual Allophone System , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Francis M. Tyers,et al. Common Voice: A Massively-Multilingual Speech Corpus , 2019, LREC.

[13] Edouard Grave,et al. End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[14] Bhuvana Ramabhadran,et al. Speech Recognition with Augmented Synthesized Speech , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15] Xiaofei Wang,et al. A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[16] Florian Metze,et al. Multilingual Speech Recognition with Corpus Relatedness Sampling , 2019, INTERSPEECH.

[17] Haizhou Li,et al. VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019 , 2019, INTERSPEECH.

[18] Alan W. Black,et al. CMU Wilderness Multilingual Speech Dataset , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Ron J. Weiss,et al. Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20] Eneko Agirre,et al. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[21] Siddharth Dalmia,et al. Epitran: Precision G2P for Many Languages , 2018, LREC.

[22] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[23] Guillaume Lample,et al. Word Translation Without Parallel Data , 2017, ICLR.

[24] Lukás Burget,et al. Semi-Supervised DNN Training with Word Selection for ASR , 2017, INTERSPEECH.

[25] Yajie Miao,et al. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[26] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[27] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28] Harald Hammarström,et al. Glottolog/Langdoc: Defining Dialects, Languages, and Language Families as Collections of Resources , 2011, LISC.

[29] Kevin P. Scannell. The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[30] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[31] Sarah L. Nesbeitt. Ethnologue: Languages of the World , 1999 .