Two techniques for speaker adaptation based on frequency scale modifications are described and evaluated. In one method, minimum mean square error matching is performed between a spectral template for each speaker to a "typical speaker" spectral template. One parameter, a warping factor, is used to control the spectral matching. In the second method, a neural network classifier is used to adjust the frequency warping factor for each speaker so as to maximize vowel classification performance for each speaker. A vowel classifier trained only with normalized female speech and tested only with normalized male speech, or vice versa, is nearly as accurate as when speaker genders are matched for training and testing, and the speech is not normalized. The improvement due to normalization is much smaller, if training and test data are matched. The normalization based on classification performance is superior to that based on minimizing mean square error.
[1]
Yunxin Zhao,et al.
Speaker normalization using constrained spectra shifts in auditory filter domain
,
1993,
EUROSPEECH.
[2]
Stephen A. Zahorian,et al.
A partitioned neural network approach for vowel classification using smoothed time/frequency features
,
1999,
IEEE Trans. Speech Audio Process..
[3]
S A Zahorian,et al.
Speaker normalization of static and dynamic vowel spectral features.
,
1991,
The Journal of the Acoustical Society of America.
[4]
H. Wakita.
Normalization of vowels by vocal-tract length and its application to vowel identification
,
1977
.
[5]
Tony Robinson,et al.
A new frequency shift function for reducing inter-speaker variance
,
1993,
EUROSPEECH.
[6]
Hermann Ney,et al.
Speaker adaptive modeling by vocal tract normalization
,
2002,
IEEE Trans. Speech Audio Process..