论文信息 - Improving Deep Models of Speech Quality Prediction through Voice Activity Detection and Entropy-based Measures

Improving Deep Models of Speech Quality Prediction through Voice Activity Detection and Entropy-based Measures

This paper explores Deep machine listening for Estimating Speech Quality (DESQ), which predicts the perceived speech quality based on phoneme posterior probabilities obtained from a deep neural network. The degradation of phonemes is quantified with the entropy-based Gini measure that is compared to the mean temporal distance (MTD) proposed earlier. Since long speech pauses might have a large effect on the speech quality, we investigate if a voice activity detection (VAD) has a beneficial or detrimental effect on the predictive power of our model. The evaluation is performed by correlating the model output and mean opinion scores (MOS) of normal-hearing listeners who rated signals degraded by typical VoIP artifacts. While the Gini-based measure and MTD result in very similar predictions (with a lower computational cost for the Gini-measure), the VAD increases performance from r = 0.87 to r = 0.91 which is higher than three competing baselines (ITU-P.563, ANIQUE+, and SRM-Rnorm).

Bernd T. Meyer | Jasper Ooster | Jasper Ooster | B. Meyer

[1] Tiago H. Falk,et al. A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Doh-Suk Kim,et al. ANIQUE+: A new American national standard for non-intrusive estimation of narrowband speech quality , 2007, Bell Labs Technical Journal.

[3] Hynek Hermansky,et al. Performance monitoring for automatic speech recognition in noisy multi-channel environments , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[4] Hynek Hermansky,et al. Novel neural network based fusion for multistream ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] James M. Kates,et al. Objective Quality and Intelligibility Prediction for Users of Assistive Listening Devices: Advantages and limitations of existing tools , 2015, IEEE Signal Processing Magazine.

[6] David Pearce,et al. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[7] Sebastian Möller,et al. Speech Quality Estimation: Models and Trends , 2011, IEEE Signal Processing Magazine.

[8] B. Meyer,et al. Single-ended prediction of listening effort using deep neural networks , 2017, Hearing Research.

[9] Hynek Hermansky,et al. Mean temporal distance: Predicting ASR error from temporal properties of speech signal , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] Leo Breiman,et al. Classification and Regression Trees , 1984 .

[11] Bernd T. Meyer,et al. Single-Ended Speech Quality Prediction Based on Automatic Speech Recognition , 2018 .

[12] Bernd T. Meyer,et al. Prediction of Perceived Speech Quality Using Deep Machine Listening , 2018, INTERSPEECH.

[13] Andrew Hines,et al. TCD-VoIP, a research database of degraded speech for assessing quality in VoIP applications , 2015, 2015 Seventh International Workshop on Quality of Multimedia Experience (QoMEX).