Phone-based speech synthesis with neural network and articulatory control

The paper presents a novel method for synthesizing a speech signal using a phone-based concatenation approach. A neural network is employed for the generalization of the phone templates during synthesis. Simplified articulatory space input parameters based on a modified vowel diagram are used to provide flexible and effective articulatory control. It also enables the design of an articulatory control model for allophonic variations in the speech signal. The network approach is chosen for its non-linear mapping of the relationship between the articulatory space parameters and the spectral information of speech signal. In addition, non-linear approximation for phone template transitions is facilitated. The phone templates of the synthesizer are implicitly stored as network parameters of a medium size network. The performance of this new speech synthesis technique is demonstrated with a prototype system specifically designed for Cantonese (a common Chinese dialect) and the synthetic speech quality is assessed by informal listening tests.

[1]  F. Itakura,et al.  A statistical method for estimation of speech spectral density and formant frequencies , 1970 .

[2]  Sun-Yuan Kung,et al.  Digital neural networks , 1993, Prentice Hall Information and System Sciences Series.

[3]  Alexander L. Francis,et al.  Measuring the naturalness of synthetic speech , 1995, Int. J. Speech Technol..

[4]  P. C. Ching,et al.  From phonology and acoustic properties to automatic recognition of Cantonese , 1994, Proceedings of ICSIPNN '94. International Conference on Speech, Image Processing and Neural Networks.

[5]  Shang-Liang Chen,et al.  Orthogonal least squares learning algorithm for radial basis function networks , 1991, IEEE Trans. Neural Networks.

[6]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.