论文信息 - Power spectral density based channel equalization of large speech database for concatenative TTS system

Power spectral density based channel equalization of large speech database for concatenative TTS system

This paper proposes a channel equalization algorithm for a large speech database with application in concatenative TTS systems. The convolutional channel distortion is equalized by comparing the power spectral densities (PSDs) of utterances of different recording sessions. Autoregressive linear filters are designed on a corpus level and are used offline to filter the corresponding sentences to compensate for the relative distortions caused by the channel effects. Two experiments are carried out to evaluate the benefit of the channel equalization approach. First, this method is used to reduce the distance of their PSDs between two recording sessions to verify the effectiveness of the method. Secondly, it is applied practically in the TTS system. The whole TTS speech database is processed to reduce the PSDs variance over all sessions. Moreover, a subjective listening test is carried out to obtain human evaluation of the new TTS system. Almost all listeners prefer the synthetic speech generated by the new TTS system. Furthermore, an analysis of variance (ANOVA) on this subjective listening test demonstrates that the channel equalization process has significant effect on increasing the perceived voice-quality consistency of the TTS system.

Yu Shi | Hu Peng | Min Chu | Eric Chang

[1] José Carlos Príncipe,et al. Nonlinear dynamic modeling of the voiced excitation for improved speech synthesis , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2] Michael W. Macon,et al. Spectral modification for concatenative speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[3] P. Welch. The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms , 1967 .

[4] Hu Peng,et al. Selecting non-uniform units from a very large corpus for concatenative speech synthesizer , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5] Yannis Stylianou. Assessment and correction of voice quality variabilities in large speech databases for concatenative speech synthesis , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).