Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model

Text-to-speech conversion has traditionally been performed either by concatenating short samples of speech or by using rule-based systems to convert a phonetic representation of speech into an acoustic representation, which is then converted into speech. This paper describes a text-to-speech synthesis system for modern standard Arabic based on artificial neural networks and residual excited LPC coder. The networks offer a storage-efficient means of synthesis without the need for explicit rule enumeration. These neural networks require large prosodically labeled continuous speech databases in their training stage. As such databases are not available for the Arabic language, we have developed one for this purpose. Thus, we discuss various stages undertaken for this development process. In addition to interpolation capabilities of neural networks, a linear interpolation of the coder parameters is performed to create smooth transitions at segment boundaries. A residual-excited all pole vocal tract model and a prosodic-information synthesizer based on neural networks are also described in this paper.

[1]  Sin-Horng Chen,et al.  An RNN-based prosodic information synthesizer for Mandarin text-to-speech , 1998, IEEE Trans. Speech Audio Process..

[2]  Stephen Isard,et al.  Optimal coupling of diphones , 1994, SSW.

[3]  Sofiane Baloul Développement d'un système automatique de synthèse de la parole à partir du texte arabe standard voyellé , 2003 .

[4]  Martti Vainio,et al.  Artificial Neural Network Based Prosody Models for Finnish Text-to-Speech Synthesis , 2001 .

[5]  N. Dixon,et al.  Terminal analog synthesis of continuous speech using the diphone method of segment assembly , 1968 .

[6]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  José Manuel Pardo,et al.  New algorithm for spectral smoothing and envelope modification for LP-PSOLA synthesis , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Gunnar Fant,et al.  Intonation analysis and synthesis with reference to Swedish , 2004 .

[9]  Lianhong Cai,et al.  An optimized neural network based prosody model of Chinese speech synthesis system , 2002, 2002 IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering. TENCOM '02. Proceedings..

[10]  T. Dutoit,et al.  Speech synthesis for text-to-speech alignment and prosodic feature extraction , 1997, Proceedings of 1997 IEEE International Symposium on Circuits and Systems. Circuits and Systems in the Information Age ISCAS '97.

[11]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[12]  Thierry Dutoit,et al.  TTSBOX: a MATLAB toolbox for teaching text-to-speech synthesis , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[13]  Donald G. Childers,et al.  Speech Processing , 1999 .

[14]  A. Farrokhi,et al.  Predication of prosodic data in Persian text-to-speech , 2003 .

[15]  Diamantino Freitas,et al.  Segmental durations predicted with a neural network , 2003, INTERSPEECH.

[16]  Mike D. Edgington,et al.  Residual-based speech modification algorithms for text-to-speech synthesis , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[17]  John H. L. Hansen,et al.  An auditory-based distortion measure with application to concatenative speech synthesis , 1998, IEEE Trans. Speech Audio Process..

[18]  Simon King,et al.  Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis , 2004, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Thierry Dutoit,et al.  Phonetic alignment: speech synthesis based vs. hybrid HMM/ANN , 1998, ICSLP.

[20]  John H. L. Hansen,et al.  A comparison of spectral smoothing methods for segment concatenation based speech synthesis , 2002, Speech Commun..

[21]  A. J. M. M. Weijters,et al.  Speech synthesis with artificial neural networks , 1993, IEEE International Conference on Neural Networks.

[22]  Zengjun Xiang,et al.  A neural network model for Chinese speech synthesis , 1990, IEEE International Symposium on Circuits and Systems.

[23]  Gavin C. Cawley,et al.  The Application of Neural Networks to Phonetic Modelling , 1996 .

[24]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[25]  Nick Campbell,et al.  Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[26]  Tony Robinson,et al.  Speech synthesis using artificial neural networks trained on cepstral coefficients , 1993, EUROSPEECH.

[27]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[28]  Julia Hirschberg,et al.  Progress in speech synthesis , 1997 .

[29]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[30]  J. Holmes,et al.  Speech Synthesis by Rule , 1964 .

[31]  Hans-Georg Zimmermann,et al.  A data-driven method for input feature selection within neural prosody generation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Noel Massey,et al.  Text-to-speech conversion with neural networks: a recurrent TDNN approach , 1998, EUROSPEECH.