In this paper, we consider the generation of features for automatic speech recognition (ASR) that are robust to speaker-variations. One of the major causes for the degradation in the performance of ASR systems is due to inter-speaker variations. These variations are commonly modeled by a pure scaling relation between spectra of speakers enunciating the same sound. Therefore, current state-of-the art ASR systems overcome this problem of speaker variability by doing a brute-force search for the optimal scaling parameter. This procedure known as vocal-tract length normalization (VTLN) is computationally intensive. We have recently used Scale-Transform (a variation of Mellin transform) to generate features which are robust to speaker variations without the need to search for the scaling parameter. However, these features have poorer performance due to loss of phase information. In this paper, we propose to use the magnitude of Scale-Transform and a pre-computed "phase"-vector for each phoneme to generate speaker-invariant features. We compare the performance of the proposed features with conventional VTLN on a phoneme recognition task.
[1]
Leon Cohen,et al.
Scale transform in speech analysis
,
1999,
IEEE Trans. Speech Audio Process..
[2]
Li Lee,et al.
A frequency warping approach to speaker normalization
,
1998,
IEEE Trans. Speech Audio Process..
[3]
S. Umesh,et al.
Frequency warping and the Mel scale
,
2002,
IEEE Signal Processing Letters.
[4]
H. Wakita.
Normalization of vowels by vocal-tract length and its application to vowel identification
,
1977
.
[5]
Srinivasan Umesh,et al.
Non-uniform scaling based speaker normalization
,
2002,
2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.
[6]
Leon Cohen,et al.
The scale representation
,
1993,
IEEE Trans. Signal Process..