Utterance-Level Neural Confidence Measure for End-to-End Children Speech Recognition

Confidence measure is a performance index of particular importance for automatic speech recognition (ASR) systems deployed in real-world scenarios. In the present study, utterance-level neural confidence measure (NCM) in end-toend automatic speech recognition (E2E ASR) is investigated. The E2E system adopts the joint CTC-attention Transformer architecture. The prediction of NCM is formulated as a task of binary classification, i.e., accept/reject the input utterance, based on a set of predictor features acquired during the ASR decoding process. The investigation is focused on evaluating and comparing the efficacies of predictor features that are derived from different internal and external modules of the E2E system. Experiments are carried out on children speech, for which state-of-the-art ASR systems show less than satisfactory performance and robust confidence measure is particularly useful. It is noted that predictor features related to acoustic information of speech play a more important role in estimating confidence measure than those related to linguistic information. N-best score features show significantly better performance than single-best ones. It has also been shown that the metrics of EER and AUC are not appropriate to evaluate the NCM of a mismatched ASR with significant performance gap.

[1]  Liangliang Cao,et al.  Confidence Estimation for Attention-Based Sequence-to-Sequence Models for Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Elisenda Bou,et al.  Confidence Measures in Encoder-Decoder Models for Speech Recognition , 2020, INTERSPEECH.

[3]  B. Chigier,et al.  Rejection and keyword spotting algorithms for a directory assistance city name recognition application , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[5]  Chin-Hui Lee,et al.  Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition , 1996, IEEE Trans. Speech Audio Process..

[6]  Shrikanth Narayanan,et al.  End-to-End Neural Systems for Automatic Children Speech Recognition: An Empirical Study , 2021, Comput. Speech Lang..

[7]  Mitch Weintraub,et al.  Neural-network based measures of confidence for word recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Alfons Juan-Císcar,et al.  Estimating Confidence Measures for Speech Recognition Verification Using a Smoothed Naive Bayes Model , 2003, IbPRIA.

[9]  Panayiotis G. Georgiou,et al.  Transfer Learning from Adult to Children for Speech Recognition: Evaluation, Analysis and Recommendations , 2018, Comput. Speech Lang..

[10]  Wei Liu,et al.  The CUHK-TUDELFT System for The SLT 2021 Children Speech Recognition Challenge , 2020, ArXiv.

[11]  Hairong Liu,et al.  Active Learning for Speech Recognition: the Power of Gradients , 2016, ArXiv.

[12]  Joseph Polifroni,et al.  Recognition confidence scoring and its use in speech understanding systems , 2002, Comput. Speech Lang..

[13]  Quoc V. Le,et al.  Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[14]  Philip C. Woodland,et al.  Speaker adaptation using lattice-based MLLR , 2001 .

[15]  Ricky Ho Yin Chan,et al.  Improving broadcast news transcription by lightly supervised discriminative training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[17]  Gunnar Evermann,et al.  Large vocabulary decoding and confidence estimation using word posterior probabilities , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18]  Chalapathy Neti,et al.  Word-based confidence measures as a guide for stack search in speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Anton Ragni,et al.  Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Kaisheng Yao,et al.  Estimating confidence scores on ASR results using recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Gökhan Tür,et al.  Combining active and semi-supervised learning for spoken language understanding , 2005, Speech Commun..

[24]  Bernhard Rüber,et al.  Obtaining confidence measures from sentence probabilities , 1997, EUROSPEECH.

[25]  Zhijian Ou,et al.  The SLT 2021 Children Speech Recognition Challenge: Open Datasets, Rules and Baselines , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[26]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Ankur Kumar,et al.  Utterance Confidence Measure for End-to-End Speech Recognition with Applications to Distributed Speech Recognition Scenarios , 2020, INTERSPEECH.

[28]  Mark J. F. Gales,et al.  Bi-directional Lattice Recurrent Neural Networks for Confidence Estimation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[30]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).