Feature extraction for unit selection in concatenative speech synthesis: comparison between AIM, LPC, and MFCC
暂无分享,去创建一个
A comprehensive computational model of the human auditory peripherals (AIM) was applied to extract basic features of speech sounds aiming at optimal unit selection in concatenative speech synthesis. The performance of AIM was compared to that of a purely physical model (LPC) as well as that of an approximate auditory model (MFCC) by basic perceptual experiments. While a significant advantage of AIM over LPC was observed, the performance based on AIM selection and MFCC selection did not differ significantly. However, a phoneme space based on the AIM features did not completely match one based on the MFCC features, demonstrating that the selection was not perfect yet. A detailed investigation conducted on the case of poor concatenation indicates that acoustic discontinuity at comparatively steady phonemic boundaries, especially those between vowel-like sounds, spoils perceptual impression. Sensitivity to such discontinuity will be required in order to further improve acoustic measures for unit selection.
[1] Nick Campbell,et al. Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.
[2] R. Patterson,et al. Time-domain modeling of peripheral auditory processing: a modular architecture and a software platform. , 1995, The Journal of the Acoustical Society of America.
[3] Minoru Tsuzaki. Feature extraction by auditory modeling for unit selection in concatenative speech synthesis , 2001, INTERSPEECH.