WER-BERT: Automatic WER Estimation with BERT in a Balanced Ordinal Classification Paradigm

Audio Speech Recognition (ASR) systems are evaluated using Word Error Rate (WER) which is calculated by comparing the number of errors between the ground truth and the ASR system’s transcription. This calculation, however, requires manual transcription of the speech signal to obtain the ground truth. Since transcribing audio signals is a costly process, Automatic WER Evaluation (e-WER) methods have been developed which attempt to predict the WER of a Speech system by only relying on the transcription and the speech signal features. While WER is a continuous variable, previous works have shown that positing eWER as a classification problem is more effective than regression. However, while converting to a classification setting, these approaches suffer with heavy class imbalance. In this paper, we propose a new balanced paradigm for e-WER in classification setting. Within this paradigm, we also propose WER-BERT, a BERT based architecture with speech features for e-WER. Furthermore, we introduce a distance loss function to tackle the ordinal nature of e-WER classification. The proposed approach and paradigm are evaluated on the Librispeech dataset and a commercial (black box) ASR system, Google Cloud’s Speech-toText API. The results and experiments demonstrate that WER-BERT establishes a new stateof-the-art in automatic WER estimation.

[1]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[2]  Avinash Madasu,et al.  Sequential Learning of Convolutional Features for Effective Text Classification , 2019, EMNLP.

[3]  Eibe Frank,et al.  A Simple Approach to Ordinal Classification , 2001, ECML.

[4]  Wei Dai,et al.  Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[6]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[7]  Benjamin Lecouteux,et al.  ASR Performance Prediction on Unseen Broadcast Programs Using Convolutional Neural Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[9]  Su-Youn Yoon,et al.  Predicting word accuracy for the automatic speech recognition of non-native speech , 2010, INTERSPEECH.

[10]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Avinash Madasu,et al.  A Position Aware Decay Weighted Network for Aspect Based Sentiment Analysis , 2020, NLDB.

[13]  Avinash Madasu,et al.  Sequential Domain Adaptation through Elastic Weight Consolidation for Sentiment Analysis , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  Steve Renals,et al.  Word Error Rate Estimation for Speech Recognition: e-WER , 2018, ACL.

[16]  José Guilherme Camargo de Souza,et al.  TranscRater: a Tool for Automatic Speech Recognition Quality Estimation , 2016, ACL.

[17]  Olivier Galibert,et al.  INVESTIGATING ROBUSTNESS OF A DEEP ASR PERFORMANCE PREDICTION SYSTEM , 2019 .

[18]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[19]  Philip C. Woodland,et al.  Detecting deletions in ASR output , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  R. Mamidi,et al.  A Sentiwordnet Strategy for Curriculum Learning in Sentiment Analysis , 2020, NLDB.

[21]  Philip C. Woodland,et al.  Combining Information Sources for Confidence Estimation with CRF Models , 2011, INTERSPEECH.

[22]  Yifan Gong,et al.  Predicting speech recognition confidence using deep learning with word identity and score features , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Atsunori Ogawa,et al.  ASR error detection and recognition rate estimation using deep bidirectional recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Atsunori Ogawa,et al.  Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks , 2017, Speech Commun..