Time alignment of natural speech to synthetic speech

A capacity to carry out reliable automatic time alignment of synthetic speech to naturally produced speech offers potential benfits in speech recognition and speaker recognition as well as in synthesis itself. Phrase alignment experiments are described that indicate that alignment to synthetic speech is more difficult than alignment of speech from two natural speakers. An artificial speech recognition experiment is introduced as a convenient means of assessing alignment accuracy. By this measure, alignment accuracy is found to be improved considerably by applying certain speaker adaptation transformations to the synthetic speech, by modifying the spectrum similarity metric, and by generating the synthetic spectra directly from the control parameters using simplified excitation spectra. The improvements seem to limit, however, at a level below that found between natural speakers. It is conjectured that further improvement requires modifications to the synthesis rules themselves.

[1]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[2]  Kuldip K. Paliwal,et al.  Synthesis‐based recognition of continuous speech , 1979 .

[3]  Craig Cook,et al.  Word verification in a speech understanding system , 1976, ICASSP.

[4]  M. Hunt Speaker adaptation for word‐based speech recognition systems , 1981 .

[5]  Lawrence R. Rabiner,et al.  Isolated word recognition using a two-pass pattern recognition approach , 1981, ICASSP.

[6]  M. Hunt Further experiments in text-independent speaker recognition over communications channels , 1983, ICASSP.

[7]  Stephen E. Levinson,et al.  On temporal alignment of sentences of natural and synthetic speech , 1983 .

[8]  David S. Pallett,et al.  Speech recognition performance assessments and available databases , 1983, ICASSP.

[9]  Gary M. Kuhn On talker-independent word recognition in continuous speech , 1982, ICASSP.

[10]  M. Tomlinson,et al.  The discriminative network: A mechanism for focusing recognition in whole-word pattern matching , 1983, ICASSP.

[11]  A. E. Rosenberg,et al.  Evaluation of an automatic speaker-verification system over telephone lines , 1976, The Bell System Technical Journal.

[12]  John S. Bridle,et al.  ZIP: A dynamic programming algorithm for time-aligning two indefinitely long utterances , 1983, ICASSP.

[13]  Hermann Ney,et al.  Speaker recognition using a feature weighting technique , 1982, ICASSP.