论文信息 - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition

ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition

Speech recognition applications cover a range of different audio and text distributions, with different speaking styles, background noise, transcription punctuation and character casing. However, many speech recognition systems require dataset-specific tuning (audio filtering, punctuation removal and normalisation of casing), therefore assuming a-priori knowledge of both the audio and text distributions. This tuning requirement can lead to systems failing to generalise to other datasets and domains. To promote the development of multi-domain speech systems, we introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition (ASR) system across a broad set of speech datasets. Benchmarked systems must use the same data preand post-processing algorithm across datasets assuming the audio and text data distributions are a-priori unknown. We compare a series of state-of-the-art (SoTA) end-to-end (E2E) systems on this benchmark, demonstrating how a single speech system can be applied and evaluated on a wide range of data distributions. We find E2E systems to be effective across datasets: in a fair comparison, E2E systems achieve within 2.6% of SoTA systems tuned to a specific dataset. Our analysis reveals that transcription artefacts, such as punctuation and casing, pose difficulties for ASR systems and should be included in evaluation. We believe E2E benchmarking over a range of datasets promotes the research of multi-domain speech recognition systems. ESB is available at https://huggingface.co/esb.

Alexander M. Rush | Patrick von Platen | Sanchit Gandhi

[1] Earnings-22: A Practical Benchmark for Accents in the Wild , 2022, ArXiv.

[2] Patrick von Platen,et al. XTREME-S: Evaluating Cross-lingual Speech Representations , 2022, INTERSPEECH.

[3] Juan Pino,et al. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale , 2021, INTERSPEECH.

[4] Tara N. Sainath,et al. BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition , 2021, IEEE Journal of Selected Topics in Signal Processing.

[5] Jong Wook Kim,et al. Robust Speech Recognition via Large-Scale Weak Supervision , 2022, ICML.

[6] Vijay Janapa Reddi,et al. The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage , 2021, NeurIPS Datasets and Benchmarks.

[7] Alexander M. Rush,et al. Datasets: A Community Library for Natural Language Processing , 2021, EMNLP.

[8] Javier Jorge,et al. Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization , 2021, Interspeech 2021.

[9] Xiangang Li,et al. GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10, 000 Hours of Transcribed Audio , 2021, Interspeech.

[10] Andy T. Liu,et al. SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[11] Brian Kingsbury,et al. On the limit of English conversational speech recognition , 2021, Interspeech.

[12] Shinji Watanabe,et al. SPGISpeech: 5, 000 hours of transcribed financial audio for fully formatted end-to-end speech recognition , 2021, Interspeech.

[13] Mohammad Norouzi,et al. SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network , 2021, ArXiv.

[14] Ronny Krashinsky,et al. NVIDIA A100 Tensor Core GPU: Performance and Innovation , 2021, IEEE Micro.

[15] Emmanuel Dupoux,et al. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[16] Michael Auli,et al. Multilingual Speech Translation from Efficient Finetuning of Pretrained Models , 2020, ACL.

[17] Gabriel Synnaeve,et al. Rethinking Evaluation in ASR: Are Our Models Robust Enough? , 2020, Interspeech.

[18] Pavel Golik,et al. How Might We Create Better Benchmarks for Speech Recognition? , 2021, BPPF.

[19] Gabriel Synnaeve,et al. MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[20] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[21] David Patterson,et al. A domain-specific supercomputer for training deep neural networks , 2020, Commun. ACM.

[22] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[23] Francis M. Tyers,et al. Common Voice: A Massively-Multilingual Speech Corpus , 2019, LREC.

[24] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[25] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[26] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[27] Edouard Grave,et al. End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[28] Junichi Yamagishi,et al. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92) , 2019 .

[29] Jia Xin Koh,et al. Building the Singapore English National Speech Corpus , 2019, INTERSPEECH.

[30] Kate Knill,et al. Impact of ASR Performance on Spoken Grammatical Error Detection , 2019, INTERSPEECH.

[31] Boris Ginsburg,et al. NeMo: a toolkit for building AI applications using Neural Modules , 2019, ArXiv.

[32] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[33] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[34] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[35] Yu Wang,et al. Impact of ASR Performance on Free Speaking Language Assessment , 2018, INTERSPEECH.

[36] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[37] Yannick Estève,et al. TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation , 2018, SPECOM.

[38] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[39] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[40] Jon Barker,et al. An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[41] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[42] Philipp Koehn,et al. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .

[43] Richard M. Stern,et al. Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[44] Ted Briscoe,et al. Grammatical error correction using neural machine translation , 2016, NAACL.

[45] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[47] Cong Liu,et al. The USTC-iFlytek System for CHiME-4 Challenge , 2016 .

[48] D. Sculley,et al. Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[49] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[51] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[52] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[53] Philipp Koehn,et al. Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[54] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[55] Kenneth Heafield,et al. KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[56] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[57] Michiel Bacchiani,et al. Restoring punctuation and capitalization in transcribed speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[58] Richard M. Stern,et al. Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis , 2008, INTERSPEECH.

[59] Thomas Hain,et al. Recognition and understanding of meetings the AMI and AMIDA projects , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[60] Jean Carletta,et al. Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[61] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[62] Ji-Hwan Kim,et al. A combined punctuation generation and speech recognition system and its performance enhancement using prosody , 2003, Speech Commun..

[63] Ji-Hwan Kim,et al. The use of prosody in a combined system for punctuation generation and speech recognition , 2001, INTERSPEECH.

[64] Danqi Chen,et al. of the Association for Computational Linguistics: , 2001 .

[65] C. Julian Chen,et al. Speech recognition with automatic punctuation , 1999, EUROSPEECH.

[66] John D. Lafferty,et al. Cyberpunc: a lightweight punctuation annotation system for speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[67] Hermann Ney,et al. On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[68] Jonathan G. Fiscus,et al. DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[69] John J. Godfrey,et al. SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.