Listeners' weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis

The quality of current commercial speech synthesis systems is now so high that system improvements are being made at subtle sub- and supra-segmental levels. Human perceptual evaluation of such subtle improvements requires a highly sophisticated level of perceptual attention to specific acoustic characteristics or cues. However, it is not well understood what acoustic cues listeners attend to by default when asked to evaluate synthetic speech. It may, therefore, be potentially quite difficult to design an evaluation method that allows listeners to concentrate on only one dimension of the signal, while ignoring others that are perceptually more important to them. The aim of the current study was to determine which acoustic characteristics of unit-selection synthetic speech are most salient to listeners when evaluating the naturalness of such speech. This study made use of multidimensional scaling techniques to analyse listeners' pairwise comparisons of synthetic speech sentences. Results indicate that listeners place a great deal of perceptual importance on the presence of artifacts and discontinuities in the speech, somewhat less importance on aspects of segmental quality, and very little importance on stress/intonation appropriateness. These relative differences in importance will impact on listeners' ability to attend to these different acoustic characteristics of synthetic speech, and should therefore be taken into account when designing appropriate methods of synthetic speech evaluation.

[1]  Margaret King,et al.  Evaluation of natural language processing systems , 1991 .

[2]  C. Mayo,et al.  Adult-child differences in acoustic cue weighting are influenced by segmental context: children are not always perceptually biased toward transitions. , 2004, The Journal of the Acoustical Society of America.

[3]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[4]  Simon King,et al.  Multidimensional scaling of listener responses to synthetic speech , 2005, INTERSPEECH.

[5]  J L Hall Application of multidimensional scaling to subjective evaluation of coded speech. , 2001, The Journal of the Acoustical Society of America.

[6]  Valérie Hazan,et al.  The development of phonemic categorization in children aged 6-12 , 2000, J. Phonetics.

[7]  Mike Plumpe,et al.  Which is more important in a concatenative text to speech system - pitch, duration, or spectral discontinuity? , 1998, SSW.

[8]  Ann K. Syrdal,et al.  Acceptability of variations in question intonation in natural and synthesized American English , 2004 .

[9]  Andrew C. Simpson,et al.  The effect of cue-enhancement on the intelligibility of nonsense word and sentence materials presented in noise , 1998, Speech Commun..

[10]  Daniel Hirst,et al.  Comparison of subjective evaluation and an objective evaluation metric for prosody in text-to-speech synthesis , 1998, SSW.

[11]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[12]  Jocelynne Watson Sibilant vowel coarticulation in the perception of speech by children with phonological disorder , 1995 .

[13]  James L. Morgan,et al.  Signal to syntax : bootstrapping from speech to grammar in early acquisition , 1996 .

[14]  Raymond N. J. Veldhuis,et al.  Reducing audible spectral discontinuities , 2001, IEEE Trans. Speech Audio Process..

[15]  Nick Campbell,et al.  ISCA special session: hot topics in speech synthesis , 2003, INTERSPEECH.

[16]  Susan Scollie,et al.  Stimulus set effects in the similarity ratings of unfamiliar complex sounds. , 2002, The Journal of the Acoustical Society of America.

[17]  J. Kreiman,et al.  Sources of listener disagreement in voice quality assessment. , 2000, The Journal of the Acoustical Society of America.

[18]  Colin W. Wightman,et al.  Segmental durations in the vicinity of prosodic phrase boundaries. , 1992, The Journal of the Acoustical Society of America.

[19]  Jody Kreiman,et al.  Perceptual relevance of source spectral slope measures , 2004 .

[20]  Michael W. Macon,et al.  A perceptual evaluation of distance measures for concatenative speech synthesis , 1998, ICSLP.

[21]  Jithendra Vepa Join cost for unit selection speech synthesis , 2004 .

[22]  D. Pisoni,et al.  Effects of talker, rate, and amplitude variation on recognition memory for spoken words , 1999, Perception & psychophysics.

[23]  Raymond N. J. Veldhuis,et al.  On the reduction of concatenation artefacts in diphone synthesis , 1998, ICSLP.

[24]  A. Cutler,et al.  Mora or Phoneme? Further Evidence for Language-Specific Listening , 1994 .

[25]  Y. Tohkura,et al.  A perceptual interference account of acquisition difficulties for non-native phonemes , 2003, Cognition.

[26]  S. Nittrouer The role of temporal and dynamic signal components in the perception of syllable-final stop voicing by children and adults. , 2004, The Journal of the Acoustical Society of America.

[27]  Graham K. Rand,et al.  Quantitative Applications in the Social Sciences , 1983 .

[28]  Yannis Stylianou,et al.  Perceptual and objective detection of discontinuities in concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[29]  Andrew C. Simpson,et al.  Enhancement techniques to improve the intelligibility of consonants in noise : speaker and listener effects , 1998, ICSLP.

[30]  Ann K. Syrdal Phonetic effects on listener detection of vowel concatenation , 2001, INTERSPEECH.

[31]  Stefan Sudhoff,et al.  Methods in empirical prosody research , 2006 .

[32]  Robert A. J. Clark,et al.  Objective methods for evaluating synthetic intonation , 1999, EUROSPEECH.

[33]  Simon King,et al.  Multisyn: Open-domain unit selection for the Festival speech synthesis system , 2007, Speech Commun..

[34]  Ann K. Syrdal,et al.  Effects on TTS quality of methods of realizing natural prosodic variations , 2003 .

[35]  C. Wardrip‐Fruin The effect of signal degradation on the status of cues to voicing in utterance‐final stop consonants , 1985 .

[36]  J. Kreiman,et al.  When and why listeners disagree in voice quality assessment tasks. , 2007, The Journal of the Acoustical Society of America.

[37]  Robert A. J. Clark Modelling pitch accents for concept-to-speech synthesis. , 2003 .

[38]  M. Vainio,et al.  Effect of prosodic naturalness on segmental acceptability in synthetic speech , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[39]  S. King,et al.  Improving Instrumental Quality Prediction Performance for the Blizzard Challenge , 2008 .

[40]  Matthias Jilka Exploration of different types of intonational deviations in foreign-accented and synthesized speech , 2005, INTERSPEECH.

[41]  J. Rueckl,et al.  Attentional Modulation of the Phonetic Significance of Acoustic Cues , 1993, Cognitive Psychology.

[42]  Catherine Mayo,et al.  The influence of spectral distinctiveness on acoustic cue weighting in children's and adults' speech perception. , 2005, The Journal of the Acoustical Society of America.

[43]  Alice Turk,et al.  Acoustic segment durations in prosodic research: a practical guide , 2006 .

[44]  A. de Cheveigné,et al.  The dependency of timbre on fundamental frequency. , 2003, The Journal of the Acoustical Society of America.

[45]  L E Humes,et al.  Identification of multidimensional stimuli containing speech cues and the effects of training. , 1997, The Journal of the Acoustical Society of America.

[46]  C R Rabinov,et al.  Comparing reliability of perceptual ratings of roughness and acoustic measure of jitter. , 1995, Journal of speech and hearing research.

[47]  Sebastian Möller,et al.  Quality prediction for synthesized speech: Comparison of approaches , 2009 .

[48]  Catherine T. Best,et al.  Perceptual equivalence of acoustic cues in speech and nonspeech perception , 1981, Perception & psychophysics.

[49]  J Kreiman,et al.  Validity of rating scale measures of voice quality. , 1998, The Journal of the Acoustical Society of America.

[50]  M. Aldenderfer,et al.  Cluster Analysis. Sage University Paper Series On Quantitative Applications in the Social Sciences 07-044 , 1984 .

[51]  P Allen,et al.  Multidimensional scaling of complex sounds by school-aged children and adults. , 1997, The Journal of the Acoustical Society of America.

[52]  Abeer Alwan,et al.  Text to Speech Synthesis: New Paradigms and Advances , 2004 .

[53]  Carolyn Wardrip–Fruin,et al.  On the status of temporal cues to phonetic categories: Preceding vowel duration as a cue to voicing in final stop consonants , 1982 .

[54]  Paul Iverson,et al.  Phonetic training with acoustic cue manipulations: a comparison of methods for teaching English /r/-/l/ to Japanese adults. , 2005, The Journal of the Acoustical Society of America.

[55]  Alexander L. Francis,et al.  Cue-specific effects of categorization training on the relative weighting of acoustic cues to consonant voicing in English. , 2008, The Journal of the Acoustical Society of America.

[56]  Nick Campbell,et al.  Objective distance measures for assessing concatenative speech synthesis , 1999, EUROSPEECH.

[57]  J. Pind The Discovery of Spoken Language, Peter W. Jusczyk (Ed.). MIT Press (1997), ISBN 0 262 10058 4 , 1997 .