Speech Generation in a Spoken Dialogue System

Spoken dialogue systems accessed over the telephone network are rapidly becoming more popular as a means to reduce call-centre costs and improve customer experience. It is now technologically feasible to delegate repetitive and relatively simple tasks conducted in most telephone calls to automatic systems. Such a system uses speech recognition to take input from users. This work focuses on the speech generation component that a specific prototype system uses to convey audible speech output back to the user. Many commercial systems contain general text-to-speech synthesisers. Text-to-speech synthesis is a very active branch of speech processing. It aims to build machines that read text aloud. In some languages this has been a reality for almost two decades. While these synthesisers are often very understandable, they almost never sound natural. The output quality of synthetic speech is considered to be a very important factor in the user’s perception of the quality and usability of spoken dialogue systems. The static nature of the spoken dialogue system is exploited to produce a custom speech synthesis component that provides very high quality output speech for the particular application. To this end the current state of the art in speech synthesis is surveyed and summarised. A unit-selection synthesiser is produced that functions in Afrikaans, English and Xhosa. The unit-selection synthesiser selects short waveforms from a recorded speech corpus, and concatenates them to produce the required utterances. Techniques are developed for designing a compact corpus and processing it to produce a unit-selection database. Speech modification methods were researched to build a framework for natural-sounding speech concatenation. This framework also provides pitch and duration modification capabilities that will enable research in languages such as Afrikaans and Xhosa where text-to-speech capabilities are relatively immature.

[1]  Melvin J. Hinich Detecting a hidden periodic signal when its period is unknown , 1982 .

[2]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[3]  Meir Tzur,et al.  Efficient periodicity extraction based on sine-wave representation and its application to pitch determination of speech signals , 2001, INTERSPEECH.

[4]  Jerome R. Bellegarda,et al.  Statistical prosodic modeling: from corpus design to parameter estimation , 2001, IEEE Trans. Speech Audio Process..

[5]  Michael W. Macon,et al.  Authoring tools for speech synthesis using the sable markup standard , 1999, EUROSPEECH.

[6]  Ann K. Syrdal,et al.  Inter-transcriber reliability of toBI prosodic labeling , 2000, INTERSPEECH.

[7]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[8]  Hajime Kobayashi,et al.  Weighted autocorrelation for pitch extraction of noisy speech , 2001, IEEE Trans. Speech Audio Process..

[9]  Mari Ostendorf,et al.  SABLE: a standard for TTS markup , 1998, ICSLP.

[10]  Alan W. Black,et al.  Optimal data selection for unit selection synthesis , 2001, SSW.

[11]  Erhard Rank,et al.  Exploiting improved parameter smoothing within a hybrid concatenative/LPC speech synthesizer , 1999, EUROSPEECH.

[12]  Sridha Sridharan,et al.  Trainable speech synthesis with trended hidden Markov models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  Peter Jackson,et al.  Non-uniform unit selection and the similarity metric within BT's Laureate TTS system , 1998, SSW.

[14]  Alistair Conkie A robust unit selection system for speech synthesis , 1999 .

[15]  Hermann Ney,et al.  Dynamic programming algorithm for optimal estimation of speech parameter contours , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  Mari Ostendorf,et al.  Flexible speech synthesis using weighted finite-state transducers , 2002 .

[17]  Edward A. Lee,et al.  Adaptive Signal Models: Theory, Algorithms, and Audio Applications , 1998 .

[18]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[19]  Darragh O'Brien,et al.  Concatenative synthesis based on a harmonic model , 2001, IEEE Trans. Speech Audio Process..

[20]  Satoshi Nakamura,et al.  Robust fundamental frequency estimation using instantaneous frequencies of harmonic components , 2000, INTERSPEECH.

[21]  Jon R. W. Yi,et al.  Corpus-based unit selection for natural-sounding speech synthesis , 2003 .

[22]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[23]  Julia Hirschberg,et al.  Progress in speech synthesis , 1997 .

[24]  Fabrice Plante,et al.  A pitch extraction reference database , 1995, EUROSPEECH.

[25]  Jean Rouat,et al.  A pitch determination and voiced/unvoiced decision algorithm for noisy speech , 1995, Speech Commun..

[26]  F. Park ROBUST UNIT SELECTION SYSTEM FOR SPEECH SYNTHESIS , 1999 .

[27]  Thomas F. Quatieri,et al.  Pitch estimation and voicing detection based on a sinusoidal speech model , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[28]  Yannis Stylianou,et al.  Single complex sinusoid and ARHE model based pitch extractors , 1999, EUROSPEECH.

[29]  Dmitry E. Terez,et al.  Robust pitch determination using nonlinear state-space embedding , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Michael W. Macon,et al.  Unit fusion for concatenative speech synthesis , 2000, INTERSPEECH.

[31]  Elmar Nöth,et al.  Robust pitch period detection using dynamic programming with an ANN cost function , 1995, EUROSPEECH.

[32]  Hideki Kawahara,et al.  Comparative evaluation of F estimation algorithms , 2001 .

[33]  P Taylor,et al.  Analysis and synthesis of intonation using the Tilt model. , 2000, The Journal of the Acoustical Society of America.

[34]  Eyal Yair,et al.  Super resolution pitch determination of speech signals , 1991, IEEE Trans. Signal Process..

[35]  James R. Glass,et al.  Natural-sounding speech synthesis using variable-length units , 1998, ICSLP.

[36]  Chilin Shih,et al.  Synthesis of prosodic styles , 2001, SSW.

[37]  Aaron E. Rosenberg,et al.  A comparative performance study of several pitch detection algorithms , 1976 .

[38]  S. Seneff Real‐time harmonic pitch detector , 1976 .

[39]  Chilin Shih,et al.  Prosody modeling with soft templates , 2003, Speech Commun..

[40]  A.W. Black,et al.  Unit selection without a phoneme set , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[41]  Eric Moulines,et al.  Non-parametric techniques for pitch-scale and time-scale modification of speech , 1995, Speech Commun..

[42]  C. E. Schmidt,et al.  Applications of nonlinear smoothing to speech processing , 1975 .

[43]  Philip J. B. Jackson,et al.  Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech , 2001, IEEE Trans. Speech Audio Process..

[44]  Paul C. Bagshaw,et al.  Enhanced pitch tracking and the processing of F0 contours for computer aided intonation teaching , 1993, EUROSPEECH.

[45]  Konrad Scheffler,et al.  Probabilistic simulation of human-machine dialogues , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[46]  F. Cesbron Pitch detection using the short-term phase spectrum , 1992 .

[47]  Thomas F. Quatieri,et al.  Shape invariant time-scale and pitch modification of speech , 1992, IEEE Trans. Signal Process..

[48]  Takashi Saitoh,et al.  An automatic pitch-marking method using wavelet transform , 2000, INTERSPEECH.

[49]  Ann K. Syrdal,et al.  Improving TTS by higher agreement between predicted versus observed pronunciations , 2004, SSW.

[50]  J. L. Flanagan,et al.  PHASE VOCODER , 2008 .

[51]  Paul Christopher Bagshaw,et al.  Automatic prosodic analysis for computer aided pronunciation teaching , 1994 .

[52]  J. C. Roux,et al.  Xhosa: A tone or pitch–accent language? , 1998 .

[53]  Alan W. Black,et al.  Prosody and the Selection of Source Units for Concatenative Synthesis , 1997 .

[54]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[55]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[56]  Thierry Dutoit,et al.  Phonetic alignment: speech synthesis-based vs. Viterbi-based , 2003, Speech Commun..

[57]  Mark J. T. Smith,et al.  Analysis-by-Synthesis/Overlap-Add Sinusoidal Modeling Applied to the Analysis and Synthesis of Musical Tones , 1992 .

[58]  Shubha Kadambe,et al.  Application of the wavelet transform for pitch detection of speech signals , 1992, IEEE Trans. Inf. Theory.

[59]  John H. L. Hansen,et al.  A comparison of spectral smoothing methods for segment concatenation based speech synthesis , 2002, Speech Commun..

[60]  Jan P. H. van Santen,et al.  Methods for optimal text selection , 1997, EUROSPEECH.

[61]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[62]  Paul Taylor,et al.  Speech synthesis by phonological structure matching , 1999, EUROSPEECH.

[63]  Matthew J. Makashay,et al.  Corpus-based techniques in the AT&t nextgen synthesis system , 2000, INTERSPEECH.

[64]  K. Pavan Kumar SPEECH SYNTHESIS BASED ON SINUSOIDAL MODELING , 2004 .

[65]  Justin Fackrell,et al.  Segment selection in the L&h Realspeak laboratory TTS system , 2000, INTERSPEECH.

[66]  Lawrence R. Rabiner,et al.  Applications of a nonlinear smoothing algorithm to speech processing , 1975 .

[67]  John-Paul Hosom,et al.  When will synthetic speech sound human: role of rules and data , 2000, INTERSPEECH.

[68]  Robert I. Damper,et al.  Prospects for articulatory synthesis: A position paper , 2001, SSW.

[69]  Ann K. Syrdal,et al.  Perceptually based automatic prosody labeling and prosodically enriched unit selection improve concatenative text-to-speech synthesis , 2000, INTERSPEECH.

[70]  Antje Schweitzer,et al.  Restricted unlimited domain synthesis , 2003, INTERSPEECH.

[71]  Björn Granström,et al.  Developments and paradigms in intonation research , 2001, Speech Commun..

[72]  J P Martens,et al.  Pitch and voiced/unvoiced determination with an auditory model. , 1992, The Journal of the Acoustical Society of America.

[73]  Chung-Hsien Wu,et al.  Automatic generation of synthesis units and prosodic information for Chinese concatenative synthesis , 2001, Speech Commun..

[74]  Raymond N. J. Veldhuis,et al.  Reducing audible spectral discontinuities , 2001, IEEE Trans. Speech Audio Process..

[75]  P. H. Swart,et al.  Prosodic features of imperatives in Xhosa : implications for a text-to-speech system , 2000 .

[76]  Steve McLaughlin,et al.  A nonlinear algorithm for epoch marking in speech signals using poincare maps , 1998, 9th European Signal Processing Conference (EUSIPCO 1998).

[77]  Alan W. Black,et al.  Perfect synthesis for all of the people all of the time , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[78]  Jan P. H. van Santen,et al.  Combinatorial issues in text-to-speech synthesis , 1997, EUROSPEECH.

[79]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[80]  Yannis Stylianou,et al.  Stochastic modeling of spectral adjustment for high quality pitch modification , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[81]  Herman Arnold Engelbrecht Automatic phoneme recognition of South African English , 2004 .

[82]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[83]  Ludwig Schwardt Voice conversion : an investigation , 1997 .

[84]  Nick Campbell Where is the information in speech? (and to what extent can it be modelled in synthesis?) , 1998, SSW.

[85]  Ann K. Syrdal,et al.  Preselection of candidate units in a unit selection-based text-to-speech synthesis system , 2000, INTERSPEECH.

[86]  Masanobu Abe,et al.  A Japanese TTS system based on multiform units and a speech modification algorithm with harmonics reconstruction , 2001, IEEE Trans. Speech Audio Process..

[87]  Bernd Möbius,et al.  Rare Events and Closed Domains: Two Delicate Concepts in Speech Synthesis , 2003, Int. J. Speech Technol..

[88]  Alan W. Black,et al.  Using acoustic models to choose pronunciation variations for synthetic voices , 2003, INTERSPEECH.

[89]  Y. H. Gu,et al.  Co-channel speech separation using frequency bin non-linear adaptive filtering , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[90]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[91]  Lawrence R. Rabiner,et al.  On the use of autocorrelation analysis for pitch detection , 1977 .

[92]  Robert E. Donovan Topics in decision tree based speech synthesis , 2003, Comput. Speech Lang..

[93]  Alex Acero,et al.  Whistler: a trainable text-to-speech system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[94]  Robert E. Donovan,et al.  A new distance measure for costing spectral discontinuities in concatenative speech synthesizers , 2001, SSW.

[95]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[96]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[97]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals: Algorithms and Devices , 1983 .

[98]  Teresa H. Y. Meng,et al.  Sinusoidal modeling using frame-based perceptually weighted matching pursuits , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[99]  Takao Kaneko,et al.  An LPC vocoder based on phase-equalized pitch waveform , 2003, Speech Commun..

[100]  Olivier Boëffard,et al.  Design of an optimal continuous speech database for text-to-speech synthesis considered as a set covering problem , 2001, INTERSPEECH.

[101]  Hideki Kawahara,et al.  Comparative evaluation of F0 estimation algorithms , 2001, INTERSPEECH.

[102]  Hideki Kawahara,et al.  A sinusoidal model based on frequency-to-instantaneous frequency mapping , 2000, INTERSPEECH.

[103]  Alan W. Black,et al.  Generating f0 contours for speech synthesis using the tilt intonation theory. , 1997 .

[104]  Dik J. Hermes,et al.  Pitch analysis , 1993 .

[105]  J. Fitch,et al.  A wavelet-based pitch detector for musical signals , 1999 .

[106]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[107]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..