Most automatic speech recognition systems (ASR) use Hidden Markov model (HMM) with a diagonal-covariance Gaussian mixture model for the state-conditional probability density function. The diagonal-covariance Gaussian mixture can model discrete sources of variability like speaker variations, gender variations, or local dialect, but can not model continuous types of variability that account for correlation between the elements of the feature vector. In this paper, we present a transformation of the acoustic feature vector that minimize an empirical estimate of the relative entropy between the likelihood based on the diagonal-covriance Gaussian mixture HMM model and the true likelihood. We show that this minimization is equivalent to maximizing the likelihood in the original feature space. Based on this formulation, we provide a computationally efficient solution to the problem based on volume-preserving maps; existing linear feature transform designs are shown to be special cases of the proposed solution. Since most of the acoustic features used in ASR are not linear functions of the sources of correlation in the speech signal, we use a non-linear transformation of the features to minimize this objective function. We describe an iterative algorithm to estimate the parameters of both the volume-preserving feature transformation and the hidden Markov models (HMM) that jointly optimize the objective function for an HMM-based speech recognizer. Using this algorithm, we achieved 2% improvement in phoneme recognition accuracy compared to the original system that uses the original Mel-frequency cepstral coeeficients (MFCC) acoustic features. Our approach is compared also to previous similar linear approaches like MLLT and ICA.
[1]
Geoffrey E. Hinton,et al.
A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants
,
1998,
Learning in Graphical Models.
[2]
Ho-Young Jung,et al.
Speech feature extraction using independent component analysis
,
2000,
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).
[3]
Hsiao-Wuen Hon,et al.
Speaker-independent phone recognition using hidden Markov models
,
1989,
IEEE Trans. Acoust. Speech Signal Process..
[4]
Andrej Ljolje.
The importance of cepstral parameter correlations in speech recognition
,
1994,
Comput. Speech Lang..
[5]
I. Miller.
Probability, Random Variables, and Stochastic Processes
,
1966
.
[6]
Steve J. Young,et al.
State clustering in hidden Markov model-based continuous speech recognition
,
1994,
Comput. Speech Lang..
[7]
Ramesh A. Gopinath,et al.
Maximum likelihood modeling with Gaussian distributions for classification
,
1998,
Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).
[8]
Thomas M. Cover,et al.
Elements of Information Theory
,
2005
.
[9]
Lucas C. Parra,et al.
Symplectic Nonlinear Component Analysis
,
1995,
NIPS.
[10]
Mark J. F. Gales,et al.
Semi-tied covariance matrices for hidden Markov models
,
1999,
IEEE Trans. Speech Audio Process..
[11]
Thomas Quatieri,et al.
Discrete-Time Speech Signal Processing: Principles and Practice
,
2001
.