Speech segment network approach for optimization of synthesis unit set

Abstract In this paper, a speech segment network approach for the construction of a suitable synthesis unit set with which high-quality speech can be synthesized, and yet which is of small enough size to be practical, is proposed. The speech segment network approach selects a synthesis unit set in which segmental and/or inter-segmental distortions are minimized by using combinatorial optimization methods such as iterative improvement and simulated annealing. Experimental results using diphone segments have shown that the suitable diphone unit sets, with total or maximum of inter-segmental distortion reduced by about 35 and 30%, respectively, can be constructed using this method. This reduction rate was enhanced as the segment candidate population increased. Effectiveness of this unit set design was also perceptually confirmed by a listening test, using speech synthesized with the selected diphone unit set.

[1]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[2]  Tetsuya Nomura,et al.  Speech synthesis by optimum concatenation of phoneme segments , 1990, SSW.

[3]  S. Imai Log-Magnitude Approximation (LMA) filter , 1980 .

[4]  S. Nakajima,et al.  Automatic generation of synthesis units based on context oriented clustering , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[5]  Yoshinori Sagisaka,et al.  Concatenative speech synthesis by minimum distortion criteria , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[7]  Pier Luigi Salza,et al.  Evaluation of experimental diphones for text-to-speech synthesis of Italian , 1987, ECST.

[8]  電子情報通信学会 The Transactions of the Institute of Electronics, Information and Communication Engineers , 1987 .

[9]  Shigeru Katagiri,et al.  Acoustic-phonetic labels in a Japanese speech database , 1987, ECST.

[10]  Yoshinori Sagisaka,et al.  Composite phoneme units for the speech synthesis of Japanese , 1986, Speech Commun..

[11]  Tomohisa Hirokawa,et al.  Segment selection and pitch modification for high quality speech synthesis using waveform segments , 1990, ICSLP.

[12]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[13]  N. Iwahashi,et al.  Speech Segment Selection for Concatenative Synthesis Based on Spectral Distortion Minimization , 1993 .

[14]  Emile H. L. Aarts,et al.  Simulated Annealing: Theory and Applications , 1987, Mathematics and Its Applications.

[15]  Joseph P. Olive A new algorithm for a concatenative speech synthesis system using an augmented acoustic inventory of speech sounds , 1990, SSW.