The impact of speech recognition on speech synthesis

Speech synthesis has changed dramatically in the past few years to have a corpus-based focus, borrowing heavily from advances in automatic speech recognition. In this paper, we survey technology in speech recognition systems and how it translates (or does not translate) to speech synthesis systems. We further speculate on future areas where ASR may impact synthesis and vice versa.

[1]  Vassilios Diakoloukas,et al.  Maximum-likelihood stochastic-transformation adaptation of hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[2]  Jeff A. Bilmes,et al.  Buried Markov models for speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[3]  Ellen Eide Automatic modeling of pronunciation variations , 1999, EUROSPEECH.

[4]  Jan P. H. van Santen,et al.  Assignment of segmental duration in text-to-speech synthesis , 1994, Comput. Speech Lang..

[5]  Mari Ostendorf,et al.  HMM topology design using maximum likelihood successive state splitting , 1997, Comput. Speech Lang..

[6]  Yannis Stylianou,et al.  Perceptual and objective detection of discontinuities in concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[7]  Alex Acero,et al.  Formant analysis and synthesis using hidden Markov models , 1999, EUROSPEECH.

[8]  Levent M. Arslan,et al.  Voice conversion by codebook mapping of line spectral frequencies and excitation spectrum , 1997, EUROSPEECH.

[9]  Mari Ostendorf,et al.  Use of higher level linguistic structure in acoustic modeling for speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  Keiichi Tokuda,et al.  An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features , 1995, EUROSPEECH.

[11]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[12]  Philip C. Woodland,et al.  A hidden Markov-model-based trainable speech synthesizer , 1999, Comput. Speech Lang..

[13]  Richard Wright,et al.  Prosody and phonetic variability: Lessons learned from acoustic model clustering , 2003 .

[14]  Mari Ostendorf,et al.  Prediction of abstract prosodic labels for speech synthesis , 1996, Comput. Speech Lang..

[15]  Raymond N. J. Veldhuis,et al.  Reducing audible spectral discontinuities , 2001, IEEE Trans. Speech Audio Process..

[16]  Elizabeth Shriberg,et al.  Using prosodic and lexical information for speaker identification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Gérard Bailly,et al.  Synthesising attitudes with global rhythmic and intonation contours , 1997, EUROSPEECH.

[18]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[19]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[20]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[21]  Vassilios Digalakis,et al.  Genones: generalized mixture tying in continuous hidden Markov model-based speech recognizers , 1996, IEEE Trans. Speech Audio Process..

[22]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[23]  Mari Ostendorf,et al.  Joint prosody prediction and unit selection for concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[24]  Hervé Bourlard,et al.  Neural networks for statistical recognition of continuous speech , 1995, Proc. IEEE.

[25]  Julia Hirschberg,et al.  Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[26]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[27]  Alex Acero,et al.  Whistler: a trainable text-to-speech system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[28]  Robert E. Donovan,et al.  A new distance measure for costing spectral discontinuities in concatenative speech synthesizers , 2001, SSW.

[29]  Mari Ostendorf,et al.  Text normalization with varied data sources for conversational speech language modeling , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Mari Ostendorf,et al.  Efficient integrated response generation from multiple targets using weighted finite state transducers , 2002, Comput. Speech Lang..

[31]  John H. L. Hansen,et al.  Enhancement, segmentation, and synthesis of speech with application to robust speaker recognition , 1998 .

[32]  Richard Sproat,et al.  High-accuracy automatic segmentation , 1999, EUROSPEECH.

[33]  Klaus A J Riederer 1 LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 2000 .

[34]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[35]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[36]  Alex Acero,et al.  HMM-based smoothing for concatenative speech synthesis , 1998, ICSLP.

[37]  Jerome R. Bellegarda,et al.  Statistical prosodic modeling: from corpus design to parameter estimation , 2001, IEEE Trans. Speech Audio Process..

[38]  Jeff A. Bilmes,et al.  Robust splicing costs and efficient search with BMM Models for concatenative speech synthesis , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  Elmar Nöth,et al.  Whence and Whither Prosody in Automatic Speech Understanding: A Case Study , 2002 .

[40]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[41]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[42]  Julia Hirschberg,et al.  Automatic classification of intonational phrase boundaries , 1992 .

[43]  Keiichi Tokuda,et al.  Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[44]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[45]  Mari Ostendorf,et al.  Unit selection for speech synthesis using splicing costs with weighted finite state transducers , 2001, INTERSPEECH.

[46]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[47]  Darragh O'Brien,et al.  Concatenative synthesis based on a harmonic model , 2001, IEEE Trans. Speech Audio Process..

[48]  P Taylor,et al.  Analysis and synthesis of intonation using the Tilt model. , 2000, The Journal of the Acoustical Society of America.

[49]  Harriet J. Nock,et al.  Pronunciation modeling by sharing gaussian densities across phonetic models , 1999, EUROSPEECH.

[50]  Mari Ostendorf,et al.  Flexible speech synthesis using weighted finite-state transducers , 2002 .

[51]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[52]  Alex Acero,et al.  Automatic generation of synthesis units for trainable text-to-speech systems , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[53]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[54]  Sebastian Ohnewald,et al.  Speech synthesis using stochastic Markov graphs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[55]  Jerome R. Bellegarda,et al.  Smooth contour estimation in data-driven pitch modelling , 2001, INTERSPEECH.

[56]  W. Chou Discriminant-function-based minimum recognition error rate pattern-recognition approach to speech recognition , 2000, Proc. IEEE.

[57]  Li Deng,et al.  A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition , 1998, Speech Commun..

[58]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[59]  Steve Young,et al.  A review of large-vocabulary continuous-speech , 1996, IEEE Signal Process. Mag..

[60]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..

[61]  Yoshinori Sagisaka,et al.  ATR μ-talk speech synthesis system , 1992, ICSLP.

[62]  Sadaoki Furui,et al.  Research of individuality features in speech waves and automatic speaker recognition techniques , 1986, Speech Commun..

[63]  Keiichi Tokuda,et al.  Speaker adaptation for HMM-based speech synthesis system using MLLR , 1998, SSW.

[64]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[65]  Mehryar Mohri,et al.  Rapid unit selection from a large speech corpus for concatenative speech synthesis , 1999, EUROSPEECH.

[66]  Hermann Ney,et al.  Progress in dynamic programming search for LVCSR , 2000 .

[67]  Steve J. Young,et al.  State clustering in hidden Markov model-based continuous speech recognition , 1994, Comput. Speech Lang..

[68]  Michael W. Macon,et al.  Control of spectral dynamics in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[69]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[70]  Paul Taylor,et al.  Assigning phrase breaks from part-of-speech sequences , 1997, Comput. Speech Lang..

[71]  Mari Ostendorf,et al.  A dynamical system model for generating fundamental frequency for speech synthesis , 1999, IEEE Trans. Speech Audio Process..

[72]  James R. Glass,et al.  Natural-sounding speech synthesis using variable-length units , 1998, ICSLP.

[73]  W.J.J. Roberts,et al.  Automatic speaker recognition using Gaussian mixture models , 1999, 1999 Information, Decision and Control. Data and Information Fusion Symposium, Signal Processing and Communications Symposium and Decision and Control Symposium. Proceedings (Cat. No.99EX251).

[74]  Michael W. Macon,et al.  Optimized stopping criteria for tree-based unit selection in concatenative synthesis , 1998, ICSLP.

[75]  Robert E. Donovan Segment pre-selection in decision-tree based speech synthesis systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[76]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[77]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[78]  Mari Ostendorf,et al.  Prosody prediction for speech synthesis using transformational rule-based learning , 1998, ICSLP.