We propose a speech model that describes acoustic inventories of concatenative synthesizers. The model has the following characteristics: (i) very compact representations and thus high compression ratios are possible, (ii) re-synthezised speech is free of concatenation errors, (iii) the degree of articulation can be controlled explicitly, and (iv) voice transformation is feasible with relatively few additional recordings of a target speaker. The model represents a speech unit as a synthesis of several types of features, each of which has been computed using non-linear, asynchronous interpolation of neighboring basis vectors associated with known phonemic identities. During analysis, basis vectors and transition weights are estimated under a strict diphone assumption using a dynamic time warping approach. During synthesis, the estimated transition weight values are modified to produce changes in duration and articulation effort.
[1]
Frantz Clermont,et al.
A methodology for modeling vowel formant contours in CVC context
,
1987
.
[2]
D H Klatt,et al.
Review of text-to-speech conversion for English.
,
1987,
The Journal of the Acoustical Society of America.
[3]
J.P.H. van Santen,et al.
Compression of acoustic inventories using asynchronous interpolation
,
2002,
Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..
[4]
Michael W. Macon,et al.
Control of spectral dynamics in concatenative speech synthesis
,
2001,
IEEE Trans. Speech Audio Process..
[5]
Alexander Kain,et al.
OGIresLPC: Diphone synthesizer using residual-excited linear prediction
,
1997
.
[6]
Bishnu S. Atal,et al.
Efficient coding of LPC parameters by temporal decomposition
,
1983,
ICASSP.