Modern methods of speech synthesis

We have examined various aspects of how to produce synthetic speech. There are numerous applications for such synthetic speech, mostly when starting from a textual input, i.e., TTS. Given the large amount of text in databases and the public's need to access information efficiently, synthetic speech is a natural way to obtain information. A major application of the future will be speech-to-speech translation, in which a person speaking in one language will be able to converse automatically with someone using another language: ASR would transcribe the original speech to a textual form in language A, then an automatic text translator would map that text to language B, and finally a TTS system for this second language would generate the output speech.

[1]  Jan P. H. van Santen,et al.  A speech model of acoustic inventories based on asynchronous interpolation , 2003, INTERSPEECH.

[2]  Thierry Dutoit,et al.  Phonetic alignment: speech synthesis-based vs. Viterbi-based , 2003, Speech Commun..

[3]  J.D. Gibson,et al.  Speech coding methods, standards, and applications , 2005, IEEE Circuits and Systems Magazine.

[4]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[5]  Olov Engwall Articulatory synthesis using corpus-based estimation of line spectrum pairs , 2005, INTERSPEECH.

[6]  Rüdiger Hoffmann,et al.  A multilingual TTS system with less than 1 Mbyte footprint for embedded applications , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[7]  Jerome R. Bellegarda,et al.  A global, boundary-centric framework for unit selection text-to-speech synthesis , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Shinsuke Sakai,et al.  A probabilistic approach to unit selection for corpus-based speech synthesis , 2005, INTERSPEECH.

[9]  Murray F. Spiegel Proper Name Pronunciations for Speech Technology Applications , 2003, Int. J. Speech Technol..

[10]  Michael Picheny,et al.  The IBM expressive text-to-speech synthesis system for American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.

[12]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[13]  Catherine J. Stevens,et al.  On-line experimental methods to evaluate text-to-speech (TTS) synthesis: effects of voice gender and signal quality on intelligibility, naturalness and preference , 2005, Comput. Speech Lang..

[14]  Stephen E. Levinson,et al.  Speech Synthesis in Telecommunications Synthesis of speech from unrestricted text is now commercially viable for telecommunications applications. , 1993 .

[15]  Raymond N. J. Veldhuis,et al.  Reducing audible spectral discontinuities , 2001, IEEE Trans. Speech Audio Process..

[16]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[17]  N. Campbell,et al.  Conversational speech synthesis and the need for some laughter , 2005, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  J. Allen,et al.  Synthesis of speech from unrestricted text , 1976, Proceedings of the IEEE.

[19]  Satoshi Nakamura,et al.  The ATR Multilingual Speech-to-Speech Translation System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  J. N. Holmes,et al.  Formant synthesizers: Cascade or parallel? , 1983, Speech Commun..

[21]  Nick Campbell,et al.  A corpus-based speech synthesis system with emotion , 2003, Speech Commun..

[22]  Hui Ye,et al.  Quality-enhanced voice morphing using maximum likelihood transformations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Rodney W. Johnson,et al.  Letter-to-sound rules for automatic translation of english text to phonetics , 1976 .

[24]  C.H. Coker,et al.  A model of articulatory dynamics and control , 1976, Proceedings of the IEEE.

[25]  Rolf Carlson,et al.  Data-driven multimodal synthesis , 2005, Speech Commun..

[26]  S.E. Levinson,et al.  Speech synthesis in telecommunications , 1993, IEEE Communications Magazine.

[27]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[28]  Sabine Buchholz,et al.  Influence of syntax on prosodic boundary prediction , 2005, INTERSPEECH.

[29]  A. Rosenberg Effect of glottal pulse shape on the quality of natural vowels. , 1969, The Journal of the Acoustical Society of America.

[30]  J. Pierrehumbert,et al.  Synthesizing intonation , 2004 .

[31]  Slava Shechtman,et al.  Small footprint concatenative text-to-speech synthesis system using complex spectral envelope modeling , 2005, INTERSPEECH.

[32]  Yannis Stylianou Removing linear phase mismatches in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[33]  Mahesh Viswanathan,et al.  Recent improvements to the IBM trainable speech synthesis system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[34]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[35]  Mahesh Viswanathan,et al.  Measuring speech quality for text-to-speech systems: development and assessment of a modified mean opinion score (MOS) scale , 2005, Comput. Speech Lang..

[36]  Esther Klabbers,et al.  Synthesis of prosody using multi-level unit sequences , 2005, Speech Commun..

[37]  Robert I. Damper,et al.  Comparative objective and subjective evaluation of three data-driven techniques for proper name pronunciation , 2005, INTERSPEECH.

[38]  John H. L. Hansen,et al.  A comparison of spectral smoothing methods for segment concatenation based speech synthesis , 2002, Speech Commun..

[39]  Wei Zhang,et al.  Toward multiple-language TTS: experiments in English and Mandarin , 2005, INTERSPEECH.

[40]  Robert E. Donovan Topics in decision tree based speech synthesis , 2003, Comput. Speech Lang..

[41]  Tomoki Toda,et al.  An evaluation of cost functions sensitively capturing local degradation of naturalness for segment selection in concatenative speech synthesis , 2006, Speech Commun..

[42]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[43]  Doh-Suk Kim,et al.  Perceptual phase quantization of speech , 2003, IEEE Trans. Speech Audio Process..

[44]  Simon King,et al.  Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis , 2004, IEEE Transactions on Audio, Speech, and Language Processing.