Further exploration of the possibilities and pitfalls of multidimensional scaling as a tool for the evaluation of the quality of synthesized speech

Multidimensional scaling (MDS) has been suggested as a useful tool for the evaluation of the quality of synthesized speech. However, it has not yet been extensively tested for its application in this specific area of evaluation. In a series of experiments based on data from the Blizzard Challenge 2008 the relations between Weighted Euclidean Distance Scaling and Simple Euclidean Distance Scaling is investigated to understand how aggregating data affects the MDS configuration. These results are compared to those collected as mean opinion scores (MOS). The ranks correspond, and MOS can be predicted from an object’s space in the MDS generated stimulus space. The big advantage of MDS over MOS is its diagnostic value; dimensions along which stimuli vary are not correlated, as is the case in modular evaluation using MOS. Finally, it will be attempted to generalize from the MDS representations of the thoroughly tested subset to the aggregated data of the larger-scale Blizzard Challenge.

[1]  Tim Futing Liao,et al.  The SAGE Encyclopedia of Social Science Research Methods, Volume II , 2003 .

[2]  Monika Podsiadlo Large Scale Speech Synthesis Evaluation , 2007 .

[3]  Simon King,et al.  The Blizzard Challenge 2008 , 2008 .

[4]  Daniel Hirst,et al.  Comparison of subjective evaluation and an objective evaluation metric for prosody in text-to-speech synthesis , 1998, SSW.

[5]  Simon King,et al.  Multidimensional scaling of listener responses to synthetic speech , 2005, INTERSPEECH.

[6]  J L Hall Application of multidimensional scaling to subjective evaluation of coded speech. , 2001, The Journal of the Acoustical Society of America.

[7]  Mark Huckvale,et al.  The reliability of the ITU-t p.85 standard for the evaluation of text-to-speech systems , 2002, INTERSPEECH.

[8]  Simon King,et al.  The Blizzard Challenge 2007 , 2007 .

[9]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[10]  A. Bryman,et al.  The Sage Encyclopedia of Social Science Research Methods (three volumes) , 2003 .

[11]  C. Coombs A theory of data. , 1965, Psychology Review.

[12]  M. Vainio,et al.  Effect of prosodic naturalness on segmental acceptability in synthetic speech , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[13]  Paul Taylor,et al.  Text-to-Speech Synthesis , 2009 .

[14]  Wendy J. Holmes,et al.  Speech Synthesis and Recognition , 1988 .

[15]  M. Guha The Sage Encyclopedia of Social Science Research Methods , 2005 .

[16]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[17]  Marc Swerts,et al.  Isca Archive , 1999 .