Corpus-based unit selection for natural-sounding speech synthesis

Speech synthesis is an automatic encoding process carried out by machine through which symbols conveying linguistic information are converted into an acoustic waveform. In the past decade or so, a recent trend toward a non-parametric, corpus-based approach has focused on using real human speech as source material for producing novel natural-sounding speech. This work proposes a communication-theoretic formulation in which unit selection is a noisy channel through which an input sequence of symbols passes and an output sequence, possibly corrupted due to the coverage limits of the corpus, emerges. The penalty of approximation is quantified by substitution and concatenation costs which grade what unit contexts are interchangeable and where concatenations are not perceivable. These costs are semi-automatically derived from data and are found to agree with acoustic-phonetic knowledge. The implementation is based on a finite-state transducer (FST) representation that has been successfully used in speech and language processing applications including speech recognition. A proposed constraint kernel topology connects all units in the corpus with associated substitution and concatenation costs and enables an efficient Viterbi search that operates with low latency and scales to large corpora. An A* search can be applied in a second, rescoring pass to incorporate finer acoustic modelling. Extensions to this FST-based search include hierarchical and paralinguistic modelling. The search can also be used in an iterative feedback loop to record new utterances to enhance corpus coverage. This speech synthesis framework has been deployed across various domains and languages in many voices, a testament to its flexibility and rapid prototyping capability. Experimental subjects completing tasks in a given air travel planning scenario by interacting in real time with a spoken dialogue system over the telephone have found the system “easiest to understand” out of eight competing systems. In more detailed listening evaluations, subjective opinions garnered from human participants are found to be correlated with objective measures calculable by machine. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Steven C. Lee Probabilistic segmentation for segment-based speech recognition , 1998 .

[2]  Timothy J. Hazen,et al.  Pronunciation modeling using a finite-state transducer representation , 2005, Speech Commun..

[3]  Peter Ladefoged,et al.  The Revised International Phonetic Alphabet. , 1990 .

[4]  Alan W. Black,et al.  Perfect synthesis for all of the people all of the time , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[5]  Victor Zue,et al.  Properties of large lexicons: Implications for advanced isolated word recognition systems , 1982, ICASSP.

[6]  Han Shu,et al.  EM training of finite-state transducers and its application to pronunciation modeling , 2002, INTERSPEECH.

[7]  Jan P. H. van Santen,et al.  Combinatorial issues in text-to-speech synthesis , 1997, EUROSPEECH.

[8]  David J. Goodman,et al.  Personal Communications , 1994, Mobile Communications.

[9]  P. Frasconi,et al.  Representation of Finite State Automata in Recurrent Radial Basis Function Networks , 1996, Machine Learning.

[10]  Mitchell P. Marcus,et al.  Parsing a Natural Language Using Mutual Information Statistics , 1990, AAAI.

[11]  Thierry Dutoit,et al.  MBR-PSOLA: Text-To-Speech synthesis based on an MBE re-synthesis of the segments database , 1993, Speech Commun..

[12]  C. Lee Giles,et al.  Constructing deterministic finite-state automata in recurrent neural networks , 1996, JACM.

[13]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[14]  Michael W. Macon,et al.  Control of spectral dynamics in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[15]  Gregory A. Sanders,et al.  DARPA communicator dialog travel planning systems: the june 2000 data collection , 2001, INTERSPEECH.

[16]  Alex Acero,et al.  Whistler: a trainable text-to-speech system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[17]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[18]  Robert E. Donovan,et al.  A new distance measure for costing spectral discontinuities in concatenative speech synthesizers , 2001, SSW.

[19]  S. Nakajima,et al.  Automatic generation of synthesis units based on context oriented clustering , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[20]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[21]  Mehryar Mohri,et al.  Rapid unit selection from a large speech corpus for concatenative speech synthesis , 1999, EUROSPEECH.

[22]  J. Olive,et al.  Rule synthesis of speech from dyadic units , 1977 .

[23]  Frédéric Bimbot,et al.  Inference of variable-length linguistic and acoustic units by multigrams , 1997, Speech Commun..

[24]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[25]  Barbara Heuft,et al.  Emotions in time domain synthesis , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[26]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[27]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[28]  T. Feustel,et al.  Capacity Demands in Short-Term Memory for Synthetic and .Natural Speech , 1983, Human factors.

[29]  D. Talkin Speech formant trajectory estimation using dynamic programming with modulated transition costs , 1987 .

[30]  Bernd Möbius Corpus-based speech synthesis : Methods and challenges , 2000 .

[31]  Doroteo Torre Toledano,et al.  Trying to mimic human segmentation of speech using HMM and fuzzy logic post-correction rules , 1998, SSW.

[32]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[33]  Shin'ya Nakajima English speech synthesis based on multi-layered context oriented clustering; towards multi-lingual speech synthesis , 1993, EUROSPEECH.

[34]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[35]  Grace Chung Automatically incorporating unknown words in JUPITER , 2000, INTERSPEECH.

[36]  Richard Sproat,et al.  Multilingual Text-to-Speech Synthesis: The Bell Labs Approach , 1998, CL.

[37]  Michael W. Macon,et al.  A perceptual evaluation of distance measures for concatenative speech synthesis , 1998, ICSLP.

[38]  Benjamin M. Serridge Context-dependent modeling in a segment-based speech recognition system , 1997 .

[39]  V.W. Zue,et al.  The use of speech knowledge in automatic speech recognition , 1985, Proceedings of the IEEE.

[40]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[41]  Chian Chuu LIESHOU : A Mandarin Conversational Task Agent for the Galaxy-II Architecture , 2003 .

[42]  Raymond N. J. Veldhuis,et al.  On the reduction of concatenation artefacts in diphone synthesis , 1998, ICSLP.

[43]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[44]  Mari Ostendorf,et al.  Joint prosody prediction and unit selection for concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[45]  Fant Cg Descriptive analysis of the acoustic aspects of speech. , 1962 .

[46]  John Nicholas Holmes,et al.  Speech synthesis , 1972 .

[47]  P Taylor,et al.  Analysis and synthesis of intonation using the Tilt model. , 2000, The Journal of the Acoustical Society of America.

[48]  Stephanie Seneff,et al.  Intelligent barge-in in conversational systems , 2000, INTERSPEECH.

[49]  Yong Zhao,et al.  Perpetually optimizing the cost function for unit selection in a TTS system with one single run of MOS evaluation , 2002, INTERSPEECH.

[50]  A. Gray,et al.  Distance measures for speech processing , 1976 .

[51]  Raymond N. J. Veldhuis,et al.  Reducing audible spectral discontinuities , 2001, IEEE Trans. Speech Audio Process..

[52]  Robert I. Damper,et al.  A multistrategy approach to improving pronunciation by analogy , 2000, CL.

[53]  Yannis Stylianou,et al.  Exploration of acoustic correlates in speaker selection for concatenative synthesis , 1998, ICSLP.

[54]  James R. Glass,et al.  Real-time telephone-based speech recognition in the Jupiter domain , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[55]  Erhard Rank,et al.  Generating emotional speech with a concatenative synthesizer , 1998, ICSLP.

[56]  Paul Taylor,et al.  A Phonetic Model of English Intonation , 1992 .

[57]  Mari Ostendorf,et al.  The impact of speech recognition on speech synthesis , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[58]  Victor W. Zue,et al.  Lexical stress and its application in large vocabulary speech recognition , 1984 .

[59]  James R. Glass,et al.  Natural-sounding speech synthesis using variable-length units , 1998, ICSLP.

[60]  Hu Peng,et al.  An objective measure for estimating MOS of synthesized speech , 2001, INTERSPEECH.

[61]  David Talkin,et al.  Voicing epoch determination with dynamic programming , 1989 .

[62]  Stephanie Seneff,et al.  The development of the MIT Lisp-machine based speech research workstation , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[63]  James R. Glass,et al.  A probabilistic framework for feature-based speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[64]  Yannis Stylianou Removing linear phase mismatches in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[65]  Victor Zue,et al.  MUXING: a telephone-access Mandarin conversational system , 2000, INTERSPEECH.

[66]  Andrej Ljolje,et al.  Automatic segmentation of speech for TTS , 1993, EUROSPEECH.

[67]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[68]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[69]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[70]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[71]  Ann K. Syrdal,et al.  Preselection of candidate units in a unit selection-based text-to-speech synthesis system , 2000, INTERSPEECH.

[72]  R. Likert,et al.  New Patterns of Management. , 1963 .

[73]  Hu Peng,et al.  A concatenative Mandarin TTS system without prosody model and prosody modification , 2001, SSW.

[74]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[75]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[76]  Anne Rogers,et al.  Parallel Speech Recognition , 2004, International Journal of Parallel Programming.

[77]  Joseph Polifroni,et al.  Formal and natural language generation in the Mercury conversational system , 2000, INTERSPEECH.

[78]  Alan W. Black,et al.  Limited domain synthesis , 2000, INTERSPEECH.

[79]  G. E. Peterson,et al.  Segmentation Techniques in Speech Synthesis , 1958 .

[80]  Victor Zue,et al.  Mokusei: a telephone-based Japanese conversational system in the weather domain , 2001, INTERSPEECH.

[81]  David R. Williams,et al.  Synthesis of initial (/s/-) stop-liquid clusters using HLsyn , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[82]  Victor Zue,et al.  A model of lexical access from partial phonetic information , 1984, ICASSP.

[83]  J.P.H. van Santen,et al.  Compression of acoustic inventories using asynchronous interpolation , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[84]  Alexander Kain,et al.  High-resolution voice transformation , 2001 .

[85]  Yannis Stylianou,et al.  Perceptual and objective detection of discontinuities in concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[86]  Eric Brill,et al.  Deducing linguistic structure from the statistics of large corpora , 1990 .

[87]  Jörn Ostermann,et al.  Multimodal speech synthesis , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[88]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[89]  Philip C. Woodland,et al.  Automatic speech synthesiser parameter estimation using HMMs , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[90]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[91]  Mari Ostendorf,et al.  Efficient integrated response generation from multiple targets using weighted finite state transducers , 2002, Comput. Speech Lang..

[92]  Paul Taylor,et al.  Speech synthesis by phonological structure matching , 1999, EUROSPEECH.

[93]  Min Tang,et al.  Voice transformations: from speech synthesis to mammalian vocalizations , 2001, INTERSPEECH.

[94]  James R. Glass,et al.  Heterogeneous measurements and multiple classifiers for speech recognition , 1998, ICSLP.

[95]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[96]  Rajeev Dujari,et al.  Parallel Viterbi search algorithm for speech recognition , 1992 .

[97]  J. Pierrehumbert The phonology and phonetics of English intonation , 1987 .

[98]  Lalit R. Bahl,et al.  Design of a linguistic statistical decoder for the recognition of continuous speech , 1975, IEEE Trans. Inf. Theory.

[99]  Werner Verhelst,et al.  An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[100]  Xuejing Sun F0 generation for speech synthesis using a multi-tier approach , 2002, INTERSPEECH.

[101]  Alan W. Black,et al.  Generating F/sub 0/ contours from ToBI labels using linear regression , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[102]  Shinya Nakajima Automatic synthesis unit generation for English speech synthesis based on multi-layered context oriented clustering , 1994, Speech Commun..

[103]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[104]  Stephen Isard,et al.  Optimal coupling of diphones , 1994, SSW.

[105]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[106]  Richard Sproat Multilingual text analysis for text-to-speech synthesis , 1996, Nat. Lang. Eng..

[107]  Stephanie Seneff,et al.  Response planning and generation in the MERCURY flight reservation system , 2002, Comput. Speech Lang..

[108]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[109]  D. Pisoni,et al.  Speech perception without traditional speech cues. , 1981, Science.

[110]  Alex Acero,et al.  Automatic generation of synthesis units for trainable text-to-speech systems , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[111]  John H. L. Hansen,et al.  A comparison of spectral smoothing methods for segment concatenation based speech synthesis , 2002, Speech Commun..

[112]  Hideki Noda,et al.  A MRF-based parallel processing algorithm for speech recognition using linear predictive HMM , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[113]  Yoshinori Sagisaka,et al.  Concatenative speech synthesis by minimum distortion criteria , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[114]  Yannis Stylianou,et al.  Voice selection for speech synthesis , 1997 .

[115]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[116]  Bernd Möbius,et al.  Rare Events and Closed Domains: Two Delicate Concepts in Speech Synthesis , 2003, Int. J. Speech Technol..

[117]  M. Portnoff,et al.  Time-scale modification of speech based on short-time Fourier analysis , 1981 .

[118]  Richard Sproat,et al.  High-accuracy automatic segmentation , 1999, EUROSPEECH.

[119]  Alan W. Black,et al.  Optimal data selection for unit selection synthesis , 2001, SSW.

[120]  Shrikanth S. Narayanan,et al.  Expressive speech synthesis using a concatenative synthesizer , 2002, INTERSPEECH.

[121]  S. Seneff System to independently modify excitation and/Or spectrum of speech waveform without explicit pitch extraction , 1982 .

[122]  Gregory A. Sanders,et al.  Darpa Communicator Evaluation: Progress from 2000 to 2001 Darpa Communicator Evaluation: Progress from 2000 to 2001 , 2022 .

[123]  R. I. Damper,et al.  Stochastic phonographic transduction for English , 1996, Comput. Speech Lang..

[124]  Stephanie Seneff,et al.  GENESIS-II: a versatile system for language generation in conversational system applications , 2000, INTERSPEECH.

[125]  Tien-Lok Jonathan Lau SLLS: An Online Conversational Spoken Language Learning System , 2003 .

[126]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[127]  Michael K. McCandless,et al.  SAPPHIRE: an extensible speech analysis and recognition tool based on Tcl/Tk , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[128]  Hu Peng,et al.  Domain adaptation for TTS systems , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.