ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition

Speech recognition applications cover a range of different audio and text distributions, with different speaking styles, background noise, transcription punctuation and character casing. However, many speech recognition systems require dataset-specific tuning (audio filtering, punctuation removal and normalisation of casing), therefore assuming a-priori knowledge of both the audio and text distributions. This tuning requirement can lead to systems failing to generalise to other datasets and domains. To promote the development of multi-domain speech systems, we introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition (ASR) system across a broad set of speech datasets. Benchmarked systems must use the same data preand post-processing algorithm across datasets assuming the audio and text data distributions are a-priori unknown. We compare a series of state-of-the-art (SoTA) end-to-end (E2E) systems on this benchmark, demonstrating how a single speech system can be applied and evaluated on a wide range of data distributions. We find E2E systems to be effective across datasets: in a fair comparison, E2E systems achieve within 2.6% of SoTA systems tuned to a specific dataset. Our analysis reveals that transcription artefacts, such as punctuation and casing, pose difficulties for ASR systems and should be included in evaluation. We believe E2E benchmarking over a range of datasets promotes the research of multi-domain speech recognition systems. ESB is available at https://huggingface.co/esb.

[1]  Earnings-22: A Practical Benchmark for Accents in the Wild , 2022, ArXiv.

[2]  Patrick von Platen,et al.  XTREME-S: Evaluating Cross-lingual Speech Representations , 2022, INTERSPEECH.

[3]  Juan Pino,et al.  XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale , 2021, INTERSPEECH.

[4]  Tara N. Sainath,et al.  BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition , 2021, IEEE Journal of Selected Topics in Signal Processing.

[5]  Jong Wook Kim,et al.  Robust Speech Recognition via Large-Scale Weak Supervision , 2022, ICML.

[6]  Vijay Janapa Reddi,et al.  The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage , 2021, NeurIPS Datasets and Benchmarks.

[7]  Alexander M. Rush,et al.  Datasets: A Community Library for Natural Language Processing , 2021, EMNLP.

[8]  Javier Jorge,et al.  Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization , 2021, Interspeech 2021.

[9]  Xiangang Li,et al.  GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10, 000 Hours of Transcribed Audio , 2021, Interspeech.

[10]  Andy T. Liu,et al.  SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[11]  Brian Kingsbury,et al.  On the limit of English conversational speech recognition , 2021, Interspeech.

[12]  Shinji Watanabe,et al.  SPGISpeech: 5, 000 hours of transcribed financial audio for fully formatted end-to-end speech recognition , 2021, Interspeech.

[13]  Mohammad Norouzi,et al.  SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network , 2021, ArXiv.

[14]  Ronny Krashinsky,et al.  NVIDIA A100 Tensor Core GPU: Performance and Innovation , 2021, IEEE Micro.

[15]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[16]  Michael Auli,et al.  Multilingual Speech Translation from Efficient Finetuning of Pretrained Models , 2020, ACL.

[17]  Gabriel Synnaeve,et al.  Rethinking Evaluation in ASR: Are Our Models Robust Enough? , 2020, Interspeech.

[18]  Pavel Golik,et al.  How Might We Create Better Benchmarks for Speech Recognition? , 2021, BPPF.

[19]  Gabriel Synnaeve,et al.  MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[20]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[21]  David Patterson,et al.  A domain-specific supercomputer for training deep neural networks , 2020, Commun. ACM.

[22]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[23]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2019, LREC.

[24]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[25]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[26]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[27]  Edouard Grave,et al.  End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[28]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92) , 2019 .

[29]  Jia Xin Koh,et al.  Building the Singapore English National Speech Corpus , 2019, INTERSPEECH.

[30]  Kate Knill,et al.  Impact of ASR Performance on Spoken Grammatical Error Detection , 2019, INTERSPEECH.

[31]  Boris Ginsburg,et al.  NeMo: a toolkit for building AI applications using Neural Modules , 2019, ArXiv.

[32]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[33]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[34]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[35]  Yu Wang,et al.  Impact of ASR Performance on Free Speaking Language Assessment , 2018, INTERSPEECH.

[36]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[37]  Yannick Estève,et al.  TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation , 2018, SPECOM.

[38]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[39]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[40]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Philipp Koehn,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .

[43]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[44]  Ted Briscoe,et al.  Grammatical error correction using neural machine translation , 2016, NAACL.

[45]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[47]  Cong Liu,et al.  The USTC-iFlytek System for CHiME-4 Challenge , 2016 .

[48]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[49]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[51]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[52]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[53]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[54]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[55]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[56]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[57]  Michiel Bacchiani,et al.  Restoring punctuation and capitalization in transcribed speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[58]  Richard M. Stern,et al.  Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis , 2008, INTERSPEECH.

[59]  Thomas Hain,et al.  Recognition and understanding of meetings the AMI and AMIDA projects , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[60]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[61]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[62]  Ji-Hwan Kim,et al.  A combined punctuation generation and speech recognition system and its performance enhancement using prosody , 2003, Speech Commun..

[63]  Ji-Hwan Kim,et al.  The use of prosody in a combined system for punctuation generation and speech recognition , 2001, INTERSPEECH.

[64]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[65]  C. Julian Chen,et al.  Speech recognition with automatic punctuation , 1999, EUROSPEECH.

[66]  John D. Lafferty,et al.  Cyberpunc: a lightweight punctuation annotation system for speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[67]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[68]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[69]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.