Feature extraction for unit selection in concatenative speech synthesis: comparison between AIM, LPC, and MFCC

A comprehensive computational model of the human auditory peripherals (AIM) was applied to extract basic features of speech sounds aiming at optimal unit selection in concatenative speech synthesis. The performance of AIM was compared to that of a purely physical model (LPC) as well as that of an approximate auditory model (MFCC) by basic perceptual experiments. While a significant advantage of AIM over LPC was observed, the performance based on AIM selection and MFCC selection did not differ significantly. However, a phoneme space based on the AIM features did not completely match one based on the MFCC features, demonstrating that the selection was not perfect yet. A detailed investigation conducted on the case of poor concatenation indicates that acoustic discontinuity at comparatively steady phonemic boundaries, especially those between vowel-like sounds, spoils perceptual impression. Sensitivity to such discontinuity will be required in order to further improve acoustic measures for unit selection.