A perceptual evaluation of distance measures for concatenative speech synthesis

In concatenative synthesis, new utterances are created by concatenating segments (units) of recorded speech. When the segments are extracted from a large speech corpus, a key issue is to select segments that will sound natural in a given phonetic context. Distance measures are often used for this task. However, little is known about the perceptual relevance of these measures. More insight into the relationship between computed distances and perceptual di erences is needed to develop accurate unit selection algorithms, and to improve the quality of the resulting computer speech. In this paper, we develop a perceptual test to measure subtle phonetic di erences between speech units. We use the perceptual data to evaluate several popular distance measures. The results show that distance measures that use frequency warping perform better than those that do not, and minimal extra advantage is gained by using weighted distances or delta features.

[1]  Biing-Hwang Juang,et al.  Line spectrum pair (LSP) and speech data compression , 1984, ICASSP.

[2]  Hynek Hermansky,et al.  OPTIMIZATION OF PERCEPTUALLY-BASED ASR FRONT , 1988 .

[3]  Dennis H. Klatt,et al.  Comparative study of several distortion measures for speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Nick Campbell,et al.  Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[5]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[6]  S. Krishnan,et al.  A Comparative Study of Explicit Frequency and Conventional Signal Representations for Speech Recognition , 1996, Digit. Signal Process..

[7]  Schuyler Quackenbush,et al.  Objective measures of speech quality , 1995 .

[8]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  A. Gray,et al.  Distance measures for speech processing , 1976 .

[10]  B S Atal,et al.  Perceptual differences between vowels located in a limited phonetic space. , 1989, The Journal of the Acoustical Society of America.

[11]  Jan P. H. van Santen Prosodic Modeling in Text-to-Speech Synthesis , 1997 .

[12]  Peri Bhaskararao Subphonemic segment inventories for concatenative speech synthesis , 1995 .

[13]  Alexander Kain,et al.  OGIresLPC: Diphone synthesizer using residual-excited linear prediction , 1997 .

[14]  Dennis H. Klatt,et al.  Prediction of perceived phonetic distance from critical-band spectra: A first step , 1982, ICASSP.

[15]  Oded Ghitza,et al.  On the perceptual distance between speech segments , 1995 .