Improving Deep Models of Speech Quality Prediction through Voice Activity Detection and Entropy-based Measures

This paper explores Deep machine listening for Estimating Speech Quality (DESQ), which predicts the perceived speech quality based on phoneme posterior probabilities obtained from a deep neural network. The degradation of phonemes is quantified with the entropy-based Gini measure that is compared to the mean temporal distance (MTD) proposed earlier. Since long speech pauses might have a large effect on the speech quality, we investigate if a voice activity detection (VAD) has a beneficial or detrimental effect on the predictive power of our model. The evaluation is performed by correlating the model output and mean opinion scores (MOS) of normal-hearing listeners who rated signals degraded by typical VoIP artifacts. While the Gini-based measure and MTD result in very similar predictions (with a lower computational cost for the Gini-measure), the VAD increases performance from r = 0.87 to r = 0.91 which is higher than three competing baselines (ITU-P.563, ANIQUE+, and SRM-Rnorm).

[1]  Tiago H. Falk,et al.  A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Doh-Suk Kim,et al.  ANIQUE+: A new American national standard for non-intrusive estimation of narrowband speech quality , 2007, Bell Labs Technical Journal.

[3]  Hynek Hermansky,et al.  Performance monitoring for automatic speech recognition in noisy multi-channel environments , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[4]  Hynek Hermansky,et al.  Novel neural network based fusion for multistream ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  James M. Kates,et al.  Objective Quality and Intelligibility Prediction for Users of Assistive Listening Devices: Advantages and limitations of existing tools , 2015, IEEE Signal Processing Magazine.

[6]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[7]  Sebastian Möller,et al.  Speech Quality Estimation: Models and Trends , 2011, IEEE Signal Processing Magazine.

[8]  B. Meyer,et al.  Single-ended prediction of listening effort using deep neural networks , 2017, Hearing Research.

[9]  Hynek Hermansky,et al.  Mean temporal distance: Predicting ASR error from temporal properties of speech signal , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[11]  Bernd T. Meyer,et al.  Single-Ended Speech Quality Prediction Based on Automatic Speech Recognition , 2018 .

[12]  Bernd T. Meyer,et al.  Prediction of Perceived Speech Quality Using Deep Machine Listening , 2018, INTERSPEECH.

[13]  Andrew Hines,et al.  TCD-VoIP, a research database of degraded speech for assessing quality in VoIP applications , 2015, 2015 Seventh International Workshop on Quality of Multimedia Experience (QoMEX).