论文信息 - Modern methods of speech synthesis

Modern methods of speech synthesis

We have examined various aspects of how to produce synthetic speech. There are numerous applications for such synthetic speech, mostly when starting from a textual input, i.e., TTS. Given the large amount of text in databases and the public's need to access information efficiently, synthetic speech is a natural way to obtain information. A major application of the future will be speech-to-speech translation, in which a person speaking in one language will be able to converse automatically with someone using another language: ASR would transcribe the original speech to a textual form in language A, then an automatic text translator would map that text to language B, and finally a TTS system for this second language would generate the output speech.

D. O'Shauqhnessy

[1] Jan P. H. van Santen,et al. A speech model of acoustic inventories based on asynchronous interpolation , 2003, INTERSPEECH.

[2] Thierry Dutoit,et al. Phonetic alignment: speech synthesis-based vs. Viterbi-based , 2003, Speech Commun..

[3] J.D. Gibson,et al. Speech coding methods, standards, and applications , 2005, IEEE Circuits and Systems Magazine.

[4] Eric Moulines,et al. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[5] Olov Engwall. Articulatory synthesis using corpus-based estimation of line spectrum pairs , 2005, INTERSPEECH.

[6] Rüdiger Hoffmann,et al. A multilingual TTS system with less than 1 Mbyte footprint for embedded applications , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[7] Jerome R. Bellegarda,et al. A global, boundary-centric framework for unit selection text-to-speech synthesis , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[8] Shinsuke Sakai,et al. A probabilistic approach to unit selection for corpus-based speech synthesis , 2005, INTERSPEECH.

[9] Murray F. Spiegel. Proper Name Pronunciations for Speech Technology Applications , 2003, Int. J. Speech Technol..

[10] Michael Picheny,et al. The IBM expressive text-to-speech synthesis system for American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11] Keiichi Tokuda,et al. The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.

[12] Dennis H. Klatt,et al. Software for a cascade/parallel formant synthesizer , 1980 .

[13] Catherine J. Stevens,et al. On-line experimental methods to evaluate text-to-speech (TTS) synthesis: effects of voice gender and signal quality on intelligibility, naturalness and preference , 2005, Comput. Speech Lang..

[14] Stephen E. Levinson,et al. Speech Synthesis in Telecommunications Synthesis of speech from unrestricted text is now commercially viable for telecommunications applications. , 1993 .

[15] Raymond N. J. Veldhuis,et al. Reducing audible spectral discontinuities , 2001, IEEE Trans. Speech Audio Process..

[16] Gunnar Fant,et al. Acoustic Theory Of Speech Production , 1960 .

[17] N. Campbell,et al. Conversational speech synthesis and the need for some laughter , 2005, IEEE Transactions on Audio, Speech, and Language Processing.

[18] J. Allen,et al. Synthesis of speech from unrestricted text , 1976, Proceedings of the IEEE.

[19] Satoshi Nakamura,et al. The ATR Multilingual Speech-to-Speech Translation System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[20] J. N. Holmes,et al. Formant synthesizers: Cascade or parallel? , 1983, Speech Commun..

[21] Nick Campbell,et al. A corpus-based speech synthesis system with emotion , 2003, Speech Commun..

[22] Hui Ye,et al. Quality-enhanced voice morphing using maximum likelihood transformations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23] Rodney W. Johnson,et al. Letter-to-sound rules for automatic translation of english text to phonetics , 1976 .

[24] C.H. Coker,et al. A model of articulatory dynamics and control , 1976, Proceedings of the IEEE.

[25] Rolf Carlson,et al. Data-driven multimodal synthesis , 2005, Speech Commun..

[26] S.E. Levinson,et al. Speech synthesis in telecommunications , 1993, IEEE Communications Magazine.

[27] D H Klatt,et al. Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[28] Sabine Buchholz,et al. Influence of syntax on prosodic boundary prediction , 2005, INTERSPEECH.

[29] A. Rosenberg. Effect of glottal pulse shape on the quality of natural vowels. , 1969, The Journal of the Acoustical Society of America.

[30] J. Pierrehumbert,et al. Synthesizing intonation , 2004 .

[31] Slava Shechtman,et al. Small footprint concatenative text-to-speech synthesis system using complex spectral envelope modeling , 2005, INTERSPEECH.

[32] Yannis Stylianou. Removing linear phase mismatches in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[33] Mahesh Viswanathan,et al. Recent improvements to the IBM trainable speech synthesis system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[34] Douglas D. O'Shaughnessy,et al. Speech communication : human and machine , 1987 .

[35] Mahesh Viswanathan,et al. Measuring speech quality for text-to-speech systems: development and assessment of a modified mean opinion score (MOS) scale , 2005, Comput. Speech Lang..

[36] Esther Klabbers,et al. Synthesis of prosody using multi-level unit sequences , 2005, Speech Commun..

[37] Robert I. Damper,et al. Comparative objective and subjective evaluation of three data-driven techniques for proper name pronunciation , 2005, INTERSPEECH.

[38] John H. L. Hansen,et al. A comparison of spectral smoothing methods for segment concatenation based speech synthesis , 2002, Speech Commun..

[39] Wei Zhang,et al. Toward multiple-language TTS: experiments in English and Mandarin , 2005, INTERSPEECH.

[40] Robert E. Donovan. Topics in decision tree based speech synthesis , 2003, Comput. Speech Lang..

[41] Tomoki Toda,et al. An evaluation of cost functions sensitively capturing local degradation of naturalness for segment selection in concatenative speech synthesis , 2006, Speech Commun..

[42] H. Kucera,et al. Computational analysis of present-day American English , 1967 .

[43] Doh-Suk Kim,et al. Perceptual phase quantization of speech , 2003, IEEE Trans. Speech Audio Process..

[44] Simon King,et al. Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis , 2004, IEEE Transactions on Audio, Speech, and Language Processing.