Vocal tract normalization in speech recognition: Compensating for systematic speaker variability

The performance of speech recognition systems is often improved by accounting explicitly for sources of variability in the data. In the SWITCHBOARD corpus, studied during the 1994 CAIP workshop [Frontiers in Speech Processing Workshop II, CAIP (August 1994)], an attempt was made to compensate for the systematic variability due to different vocal tract lengths of various speakers. The method found a maximum probability parameter for each speaker which mapped an acoustic model to the mean of the models taken from a homogeneous speaker population. The underlying acoustic model was that of a straight tube, and the parameter estimation was accomplished by warping the spectrum of each speaker linearly over a 20% range (actually accomplished by digitally resampling the data), and finding the maximum a posteriori probability of the data given the warp. The technique produces statistically significant improvements in accuracy on a speech transcription task using each of four different speech recognition systems. T...