Word Error Rate Estimation Without ASR Output: e-WER2

Measuring the performance of automatic speech recognition (ASR) systems requires manually transcribed data in order to compute the word error rate (WER), which is often time-consuming and expensive. In this paper, we continue our effort in estimating WER using acoustic, lexical and phonotactic features. Our novel approach to estimate the WER uses a multistream end-to-end architecture. We report results for systems using internal speech decoder features (glass-box), systems without speech decoder features (black-box), and for systems without having access to the ASR system (no-box). The no-box system learns joint acoustic-lexical representation from phoneme recognition results along with MFCC acoustic features to estimate WER. Considering WER per sentence, our no-box system achieves 0.56 Pearson correlation with the reference evaluation and 0.24 root mean square error (RMSE) across 1,400 sentences. The estimated overall WER by e-WER2 is 30.9% for a three hours test set, while the WER computed using the reference transcriptions was 28.5%.

[1]  Ahmed Mohamed Abdel Maksoud Ali,et al.  Multi-dialect Arabic broadcast speech recognition , 2018 .

[2]  Bo Li,et al.  Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System , 2020, INTERSPEECH.

[3]  Yun Lei,et al.  ASR error detection using recurrent neural network language model and complementary ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Philip C. Woodland,et al.  Combining Information Sources for Confidence Estimation with CRF Models , 2011, INTERSPEECH.

[5]  José Guilherme Camargo de Souza,et al.  Quality Estimation for Automatic Speech Recognition , 2014, COLING.

[6]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[7]  Yifan Gong,et al.  Predicting speech recognition confidence using deep learning with word identity and score features , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Daniele Falavigna,et al.  Stacked auto-encoder for ASR error detection and word error rate prediction , 2015, INTERSPEECH.

[9]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[10]  Stephen Cox,et al.  High-level approaches to confidence estimation in speech recognition , 2002, IEEE Trans. Speech Audio Process..

[11]  Renato De Mori,et al.  ASR Error Management for Improving Spoken Language Understanding , 2017, INTERSPEECH.

[12]  Steve Renals,et al.  Word Error Rate Estimation for Speech Recognition: e-WER , 2018, ACL.

[13]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[14]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[15]  Pavel Matejka,et al.  Phonotactic language identification using high quality phoneme recognition , 2005, INTERSPEECH.

[16]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[17]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[18]  Roland Maas,et al.  Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings , 2019, INTERSPEECH.

[19]  André F. T. Martins,et al.  Findings of the WMT 2019 Shared Tasks on Quality Estimation , 2019, WMT.

[20]  James R. Glass,et al.  Exploiting Convolutional Neural Networks for Phonotactic Based Dialect Identification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Philip C. Woodland,et al.  Detecting deletions in ASR output , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Hamed Zamani,et al.  Multitask Learning for Adaptive Quality Estimation of Automatically Transcribed Utterances , 2015, NAACL.

[23]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[24]  Hervé Bourlard,et al.  Analyzing Uncertainties in Speech Recognition Using Dropout , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Kai Fan,et al.  "Bilingual Expert" Can Find Translation Errors , 2018, AAAI.

[26]  Daniele Falavigna,et al.  Driving ROVER with Segment-based ASR Quality Estimation , 2015, ACL.

[27]  Kaisheng Yao,et al.  Estimating confidence scores on ASR results using recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  James R. Glass,et al.  The MGB-2 challenge: Arabic multi-dialect broadcast media recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[29]  Shujian Huang,et al.  CCMT 2019 Machine Translation Evaluation Report , 2019 .

[30]  Lucia Specia,et al.  Groupwise learning for ASR k-best list reranking in spoken language translation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Yasser Hifny,et al.  Efficient Arabic Emotion Recognition Using Deep Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Adrià Giménez,et al.  Speaker-Adapted Confidence Measures for ASR Using Deep Bidirectional Recurrent Neural Networks , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Stephan Vogel,et al.  Speech recognition challenge in the wild: Arabic MGB-3 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[34]  Maurizio Omologo,et al.  Boosted acoustic model learning and hypotheses rescoring on the CHiME-3 task , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[35]  Lucia Specia,et al.  A study on the stability and effectiveness of features in quality estimation for spoken language translation , 2015, INTERSPEECH.

[36]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .