ASR4REAL: An extended benchmark for speech models

Popular ASR benchmarks such as Librispeech and Switchboard are limited in the diversity of settings and speakers they represent. We introduce a set of benchmarks matching reallife conditions, aimed at spotting possible biases and weaknesses in models. We have found out that even though recent models do not seem to exhibit a gender bias, they usually show important performance discrepancies by accent, and even more important ones depending on the socio-economic status of the speakers. Finally, all tested models show a strong performance drop when tested on conversational speech, and in this precise context even a language model trained on a dataset as big as Common Crawl does not seem to have significant positive effect which reiterates the importance of developing conversational language models.

[1]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[2]  Deepu Vijayasenan,et al.  NISP: A Multi-lingual Multi-accent Dataset for Speaker Profiling , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  N. Holliday Variation in Question Intonation in the Corpus of Regional African American Language , 2019, American Speech.

[4]  Meredith Ringel Morris,et al.  Toward fairness in AI for people with disabilities SBG@a research roadmap , 2019, ACM SIGACCESS Access. Comput..

[5]  Ronan Collobert,et al.  Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[6]  Lei Sun,et al.  The USTC-iFlytek systems for CHiME-5 Challenge , 2018 .

[7]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[9]  Ryo Masumura,et al.  Domain adaptation of DNN acoustic models using knowledge distillation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Pavel Golik,et al.  How Might We Create Better Benchmarks for Speech Recognition? , 2021, BPPF.

[11]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[12]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[13]  Larwan Berke,et al.  Artificial intelligence fairness in the context of accessibility research on intelligent systems for people who are deaf or hard of hearing , 2019, ACM SIGACCESS Access. Comput..

[14]  Quinten McNamara,et al.  Earnings-21: A Practical Benchmark for ASR in the Wild , 2021, Interspeech 2021.

[15]  Sanjeev Khudanpur,et al.  Investigation of transfer learning for ASR using LF-MMI trained neural networks , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[16]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[17]  Dan Jurafsky,et al.  Racial disparities in automated speech recognition , 2020, Proceedings of the National Academy of Sciences.

[18]  Gabriel Synnaeve,et al.  Rethinking Evaluation in ASR: Are Our Models Robust Enough? , 2020, Interspeech.