High-quality text-to-speech synthesis : an overview

This paper tries to give a comprehensive introduction to state-of-the-art Text-ToSpeech (TTS) synthesis by highlighting its Digital Signal Processing (DSP) and Natural Language Processing (NLP) components. As a matter of fact, since very few people associate a good knowledge of DSP with a comprehensive insight into NLP, synthesis mostly remains unclear, even for people working in either research area. After a brief definition of a general TTS system and of its commercial applications, in Section 1, the paper is basically divided into two parts. Section 2.1 begins with a presentation of the many practical NLP problems which have to be solved by a TTS system. We then examine, in Section 2.2, how synthetic speech can be obtained by simply concatenating elementary speech units, and what choices have to be made for this operation to yield high quality. We finaly give a word on existing TTS solutions, with special emphasis on the computational and economical constraints which have to be kept in mind when designing TTS systems.

[1]  Thierry Dutoit,et al.  MBR-PSOLA: Text-To-Speech synthesis based on an MBE re-synthesis of the segments database , 1993, Speech Commun..

[2]  Sheri Hunnicutt,et al.  A multi-language text-to-speech module , 1982, ICASSP.

[3]  J. Allen A perspective on man-machine communication by speech , 1985, Proceedings of the IEEE.

[4]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[5]  Louis-Jean Boë,et al.  From lexicon to rules: toward a descriptive method of French text-to-phonetics transcription , 1992, ICSLP.

[6]  David Yarowsky,et al.  Homograph disambiguation in speech synthesis , 1994, Speech Synthesis Workshop.

[7]  Walter Daelemans,et al.  Tabtalk: reusability in data-oriented grapheme-to-phoneme conversion , 1993, EUROSPEECH.

[8]  Eileen Fitzpatrick,et al.  A Computational Grammar of Discourse-Neutral Prosodic Phrasing in English , 1990, Comput. Linguistics.

[9]  Leon Gulikers,et al.  Word class assignment in a text-to-speech system , 1992, ICSLP.

[10]  James A. Anderson,et al.  Syntactic category disambiguation with neural networks , 1989 .

[11]  Isabel Trancoso,et al.  Hybrid sinusoidal modeling of speech without voicing decision , 1991, EUROSPEECH.

[12]  C. Coker A dictionary‐intensive letter‐to‐sound program , 1985 .

[13]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[14]  David Yarowsky,et al.  A corpus-based synthesizer , 1992, ICSLP.

[15]  Alex I. C. Monaghan A multi-phase parsing strategy for unrestricted text , 1990, SSW.

[16]  Julia Hirschberg Using text analysis to predict intonational boundaries , 1991, EUROSPEECH.

[17]  Christof Traber Syntactic processing and prosody control in the SVOX TTS system for German , 1993, EUROSPEECH.

[18]  Douglas D. O'Shaughnessy Design of a real-time French text-to-speech system , 1984, Speech Commun..

[19]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[20]  S.E. Levinson,et al.  Speech synthesis in telecommunications , 1993, IEEE Communications Magazine.

[21]  Kenneth Ward Church,et al.  Morphology and rhyming: two powerful alternatives to letter-to-sound rules for speech synthesis , 1990, SSW.

[22]  Robert Linggard Electronic synthesis of speech , 1985 .

[23]  Ian H. Witten Principles of computer speech , 1982 .

[24]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[25]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[26]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[27]  Françoise Emerard,et al.  Linguistic and prosodic processing for a text-to-speech synthesis system , 1989, EUROSPEECH.

[28]  J. Holmes,et al.  Speech Synthesis by Rule , 1964 .