论文信息 - Word Error Rate Estimation Without ASR Output: e-WER2

Word Error Rate Estimation Without ASR Output: e-WER2

Measuring the performance of automatic speech recognition (ASR) systems requires manually transcribed data in order to compute the word error rate (WER), which is often time-consuming and expensive. In this paper, we continue our effort in estimating WER using acoustic, lexical and phonotactic features. Our novel approach to estimate the WER uses a multistream end-to-end architecture. We report results for systems using internal speech decoder features (glass-box), systems without speech decoder features (black-box), and for systems without having access to the ASR system (no-box). The no-box system learns joint acoustic-lexical representation from phoneme recognition results along with MFCC acoustic features to estimate WER. Considering WER per sentence, our no-box system achieves 0.56 Pearson correlation with the reference evaluation and 0.24 root mean square error (RMSE) across 1,400 sentences. The estimated overall WER by e-WER2 is 30.9% for a three hours test set, while the WER computed using the reference transcriptions was 28.5%.

Steve Renals | Ahmed Ali | S. Renals | Ahmed M. Ali

[1] Ahmed Mohamed Abdel Maksoud Ali,et al. Multi-dialect Arabic broadcast speech recognition , 2018 .

[2] Bo Li,et al. Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System , 2020, INTERSPEECH.

[3] Yun Lei,et al. ASR error detection using recurrent neural network language model and complementary ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Philip C. Woodland,et al. Combining Information Sources for Confidence Estimation with CRF Models , 2011, INTERSPEECH.

[5] José Guilherme Camargo de Souza,et al. Quality Estimation for Automatic Speech Recognition , 2014, COLING.

[6] Hui Jiang,et al. Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[7] Yifan Gong,et al. Predicting speech recognition confidence using deep learning with word identity and score features , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8] Daniele Falavigna,et al. Stacked auto-encoder for ASR error detection and word error rate prediction , 2015, INTERSPEECH.

[9] Yiming Wang,et al. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[10] Stephen Cox,et al. High-level approaches to confidence estimation in speech recognition , 2002, IEEE Trans. Speech Audio Process..

[11] Renato De Mori,et al. ASR Error Management for Improving Spoken Language Understanding , 2017, INTERSPEECH.

[12] Steve Renals,et al. Word Error Rate Estimation for Speech Recognition: e-WER , 2018, ACL.

[13] Gunnar Evermann,et al. Posterior probability decoding, confidence estimation and system combination , 2000 .

[14] Sanjeev Khudanpur,et al. A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[15] Pavel Matejka,et al. Phonotactic language identification using high quality phoneme recognition , 2005, INTERSPEECH.

[16] Xiaodong Cui,et al. English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[17] Geoffrey Zweig,et al. Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[18] Roland Maas,et al. Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings , 2019, INTERSPEECH.

[19] André F. T. Martins,et al. Findings of the WMT 2019 Shared Tasks on Quality Estimation , 2019, WMT.

[20] James R. Glass,et al. Exploiting Convolutional Neural Networks for Phonotactic Based Dialect Identification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Philip C. Woodland,et al. Detecting deletions in ASR output , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Hamed Zamani,et al. Multitask Learning for Adaptive Quality Estimation of Automatically Transcribed Utterances , 2015, NAACL.

[23] Jon Barker,et al. CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[24] Hervé Bourlard,et al. Analyzing Uncertainties in Speech Recognition Using Dropout , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Kai Fan,et al. "Bilingual Expert" Can Find Translation Errors , 2018, AAAI.

[26] Daniele Falavigna,et al. Driving ROVER with Segment-based ASR Quality Estimation , 2015, ACL.

[27] Kaisheng Yao,et al. Estimating confidence scores on ASR results using recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] James R. Glass,et al. The MGB-2 challenge: Arabic multi-dialect broadcast media recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[29] Shujian Huang,et al. CCMT 2019 Machine Translation Evaluation Report , 2019 .

[30] Lucia Specia,et al. Groupwise learning for ASR k-best list reranking in spoken language translation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Yasser Hifny,et al. Efficient Arabic Emotion Recognition Using Deep Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] Adrià Giménez,et al. Speaker-Adapted Confidence Measures for ASR Using Deep Bidirectional Recurrent Neural Networks , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33] Stephan Vogel,et al. Speech recognition challenge in the wild: Arabic MGB-3 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[34] Maurizio Omologo,et al. Boosted acoustic model learning and hypotheses rescoring on the CHiME-3 task , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[35] Lucia Specia,et al. A study on the stability and effectiveness of features in quality estimation for spoken language translation , 2015, INTERSPEECH.

[36] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .