A novel prosodic-information synthesizer based on recurrent fuzzy neural network for the Chinese TTS system

In this paper, a new technique for the Chinese text-to-speech (TTS) system is proposed. Our major effort focuses on the prosodic information generation. New methodologies for constructing fuzzy rules in a prosodic model simulating human's pronouncing rules are developed. The proposed Recurrent Fuzzy Neural Network (RFNN) is a multilayer recurrent neural network (RNN) which integrates a Self-cOnstructing Neural Fuzzy Inference Network (SONFIN) into a recurrent connectionist structure. The RFNN can be functionally divided into two parts. The first part adopts the SONFIN as a prosodic model to explore the relationship between high-level linguistic features and prosodic information based on fuzzy inference rules. As compared to conventional neural networks, the SONFIN can always construct itself with an economic network size in high learning speed. The second part employs a five-layer network to generate all prosodic parameters by directly using the prosodic fuzzy rules inferred from the first part as well as other important features of syllables. The TTS system combined with the proposed method can behave not only sandhi rules but also the other prosodic phenomena existing in the traditional TTS systems. Moreover, the proposed scheme can even find out some new rules about prosodic phrase structure. The performance of the proposed RFNN-based prosodic model is verified by imbedding it into a Chinese TTS system with a Chinese monosyllable database based on the time-domain pitch synchronous overlap add (TD-PSOLA) method. Our experimental results show that the proposed RFNN can generate proper prosodic parameters including pitch means, pitch shapes, maximum energy levels, syllable duration, and pause duration. Some synthetic sounds are online available for demonstration.

[1]  Zheng-sheng Zhang,et al.  Tone and tone sandhi in Chinese , 1988 .

[2]  Loo-Nin Teow,et al.  Effective learning in recurrent max-min neural networks , 1998, Neural Networks.

[3]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[4]  H.B.D. Sorensen,et al.  A cepstral noise reduction multi-layer neural network , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[5]  Stephen Isard,et al.  Segment durations in a syllable frame , 1991 .

[6]  Alan W. Black,et al.  Generating F/sub 0/ contours from ToBI labels using linear regression , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  Hisashi Kawai,et al.  Realization of linguistic information in the voice fundamental frequency contour of the spoken Japanese , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[8]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[9]  C. L. Giles,et al.  Dynamic recurrent neural networks: Theory and applications , 1994, IEEE Trans. Neural Networks Learn. Syst..

[10]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[11]  J. T. Hart F0 stylization in speech : straight lines versus parabolas , 1991 .

[12]  Chilin Shih,et al.  The prosodic domain of tone sandhi in Chinese , 1986 .

[13]  Chu Min A text-to-speech system with high intelligibility and naturalness for Chinese , 1996 .

[14]  Matthew Y. Chen,et al.  Tone Sandhi: Patterns across Chinese Dialects , 2000 .

[15]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[16]  Sin-Horng Chen,et al.  Vector quantization of pitch information in Mandarin speech , 1990, IEEE Trans. Commun..

[17]  Rolf Carlson,et al.  MITalk‐79: The 1979 MIT text‐to‐speech system , 1979 .

[18]  Chin-Teng Lin,et al.  Model-based synthesis of plucked string instruments by using a class of scattering recurrent networks , 2000, IEEE Trans. Neural Networks Learn. Syst..

[19]  Li-Xin Wang,et al.  Adaptive fuzzy systems and control , 1994 .

[20]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[21]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[22]  Eric Moulines,et al.  A diphone synthesis system based on time-domain prosodic modifications of speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[23]  Sin-Horng Chen,et al.  A prosodic model of Mandarin speech and its application to pitch level generation for text-to-speech , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[24]  Hermann Ney,et al.  Algorithms for bigram and trigram word clustering , 1995, Speech Commun..

[25]  Bo Shi,et al.  A Chinese text-to-speech system , 1989, EUROSPEECH.

[26]  Chin-Teng Lin,et al.  An ART-based fuzzy adaptive learning control network , 1997, IEEE Trans. Fuzzy Syst..

[27]  Y. Sagisaka,et al.  On the prediction of global F/sub 0/ shape for Japanese text-to-speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[28]  John N. Gowdy,et al.  Neural network based generation of fundamental frequency contours , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[29]  Chin-Teng Lin,et al.  An online self-constructing neural fuzzy inference network and its applications , 1998, IEEE Trans. Fuzzy Syst..

[30]  Karvel K. Thornber,et al.  Fuzzy finite-state automata can be deterministically encoded into recurrent neural networks , 1998, IEEE Trans. Fuzzy Syst..

[31]  Hsiao-Chuan Wang,et al.  Statistical models for the Chinese text-to-speech system , 1991, EUROSPEECH.

[32]  Sin-Horng Chen,et al.  An RNN-based prosodic information synthesizer for Mandarin text-to-speech , 1998, IEEE Trans. Speech Audio Process..

[33]  Zhang Jialu Acoustic parameters and phonological rules of a text-to-speech system for Chinese , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Chilin Shih,et al.  Issues in Text-to-Speech Conversion for Mandarin , 1996, Int. J. Comput. Linguistics Chin. Lang. Process..

[35]  Mari Ostendorf,et al.  A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location , 1994, CL.

[36]  Chuen-Tsai Sun,et al.  Rule-base structure identification in an adaptive-network-based fuzzy inference system , 1994, IEEE Trans. Fuzzy Syst..

[37]  Katarina Bartkova,et al.  A model of segmental duration for speech synthesis in French , 1987, Speech Commun..

[38]  Chiu-yu Tseng,et al.  The synthesis rules in a Chinese text-to-speech system , 1989, IEEE Trans. Acoust. Speech Signal Process..

[39]  Lin-Shan Lee,et al.  Digital synthesis of mandarin speech using its special characteristics , 1983 .

[40]  Dennis H. Klatt,et al.  The klattalk text-to-speech conversion system , 1982, ICASSP.

[41]  Bo Zhang,et al.  A Tree-Based Model of Prosodic Phrasing for Chinese Text-to-Speech Systems , 2001, IEEE Pacific Rim Conference on Multimedia.

[42]  Chu Min,et al.  The control of juncture and prosody in Chinese TTS system , 1996, Proceedings of Third International Conference on Signal Processing (ICSP'96).

[43]  Chiu-yu Tseng,et al.  Improved tone concatenation rules in a formant-based Chinese text-to-speech system , 1993, IEEE Trans. Speech Audio Process..

[44]  Jerry M. Mendel,et al.  Generating fuzzy rules by learning from examples , 1992, IEEE Trans. Syst. Man Cybern..

[45]  Richard P. Lippmann,et al.  An introduction to computing with neural nets , 1987 .