An objective measure for estimating MOS of synthesized speech

This paper proposes an average concatenative cost function as the objective measure for naturalness of synthesized speech. All its seven component-costs can be derived directly from the input text and the scripts of speech database. A formal Mean Opinion Score (MOS) experiment shows that the average concatenative cost and its seven components are all highly correlated with MOS obtained subjectively. The correlation coefficient between the objective measure and subjective measure is –0.872. The mean of errors in MOS estimation for individual waveforms is 0.32 with 0.40 RMSE. When estimating the overall MOS for TTS systems, the mean error is smaller than 0.05. With the proposed objective measure, it becomes possible and easy for us to track the performance in naturalness regularly. The proposed cost function could also serve as criteria for optimizing the algorithms for unit selecting and speech database pruning.

[1]  Andrew Sekey,et al.  An Objective Measure for Predicting Subjective Quality of Speech Coders , 1992, IEEE J. Sel. Areas Commun..

[2]  Aruna Bayya,et al.  Objective measures for speech quality assessment in wireless communications , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Hu Peng,et al.  Selecting non-uniform units from a very large corpus for concatenative speech synthesizer , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  Wonho Yang,et al.  Performance of current perceptual objective speech quality measures , 1999, 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No.99EX351).

[5]  Irina Cotanis Speech quality evaluation for mobile networks , 2000, 2000 IEEE International Conference on Communications. ICC 2000. Global Convergence Through Communications. Conference Record.