A Hierarchical Predictor of Synthetic Speech Naturalness Using Neural Networks

A problem when developing and tuning speech synthesis systems is that there is no well-established method of automatically rating the quality of the synthetic speech. This research attempts to obtain a new automated measure which is trained on the result of large-scale subjective evaluations employing many human listeners, i.e., the Blizzard Challenge. To exploit the data, we experiment with linear regression, feed-forward and convolutional neural network models, and combinations of them to regress from synthetic speech to the perceptual scores obtained from listeners. The biggest improvements were seen when combining stimulusand system-level predictions.

[1]  Mikko Kurimo,et al.  Objective evaluation measures for speaker-adaptive HMM-TTS systems , 2013, SSW.

[2]  S. Möller,et al.  Towards Perceptual Quality Modeling of Synthesized Audiobooks – Blizzard Challenge 2012 , 2012 .

[3]  Dong-Yan Huang,et al.  Prediction of Perceived Sound Quality of Synthetic Speech , 2011 .

[4]  Tim Polzehl,et al.  Comparison of approaches for instrumentally predicting the quality of text-to-speech systems , 2010, INTERSPEECH.

[5]  Milos Cernak,et al.  An Evaluation of Synthetic Speech Using the PESQ Measure , 2005 .

[6]  S. King,et al.  The Blizzard Challenge 2013 , 2013, The Blizzard Challenge 2013.

[7]  Hermann Ney,et al.  Bootstrap estimates for confidence intervals in ASR performance evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[9]  Tomoki Toda,et al.  Anti-Spoofing for Text-Independent Speaker Verification: An Initial Database, Comparison of Countermeasures, and Human Performance , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  S. King,et al.  Improving Instrumental Quality Prediction Performance for the Blizzard Challenge , 2008 .

[11]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[12]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[13]  Simon King,et al.  The Blizzard Challenge 2008 , 2008 .

[14]  Christian Viard-Gaudin,et al.  A Convolutional Neural Network Approach for Objective Video Quality Assessment , 2006, IEEE Transactions on Neural Networks.

[15]  S. King,et al.  The Blizzard Challenge 2011 , 2011 .

[16]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[17]  Simon King,et al.  Listeners' weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis , 2011, Speech Commun..

[18]  S. King,et al.  The Blizzard Challenge 2012 , 2012 .

[19]  S. King,et al.  The Blizzard Challenge 2010 , 2010 .

[20]  Simon King,et al.  The Blizzard Challenge 2009 , 2009 .

[21]  Haizhou Li,et al.  Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition , 2012, INTERSPEECH.

[22]  RECOMMENDATION ITU-R BS.1387-1 - Method for objective measurements of perceived audio quality , 2002 .

[23]  Lukas Latacz,et al.  Double-ended prediction of the naturalness ratings of the blizzard challenge 2008-2013 , 2015, INTERSPEECH.

[24]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.