Affine invariant features and their application to speech recognition

This paper proposes a set of affine invariant features (AIFs) for sequence data. The proposed AIFs can be calculated directly from the sequence data, and their invariance to affine transformation is proved mathematically through algebraic calculation. We apply the AIFs to speech recognition. Since the vocal tract length (VTL) difference causes to frequency warping which can be approximated well by affine transform on cepstral features [1], the AIFs of cepstral sequence provide robust features for VTL variations. We experimentally examine the invariance of AIFs of speech signals, and apply AIFs for Japanese isolated word recognition. The experimental results show that the combination of AIFs with MFCC or MFCC+Δ can lead to higher recognition rates than MFCC or MFCC+Δ only. Especially in the mismatched experiments, the combination with AIFs can reduce the error rates about 30% when compared to MFCC or MFCC+Δ only. The AIFs are expected to have other applications than speech recognition, since their invariance is general.

[1]  Alexander Kadyrov,et al.  Affine invariant features from the trace transform , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[3]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[4]  Jan Flusser,et al.  Pattern recognition by affine moment invariants , 1993, Pattern Recognit..

[5]  Nobuaki Minematsu,et al.  Random discriminant structure analysis for automatic recognition of connected vowels , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[6]  Keikichi Hirose,et al.  Multi-stream parameterization for structural speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  Leon Cohen,et al.  Scale transform in speech analysis , 1999, IEEE Trans. Speech Audio Process..

[9]  Misha Pavel,et al.  On the relative importance of various components of the modulation spectrum for automatic speech recognition , 1999, Speech Commun..

[10]  Nobuaki Minematsu Mathematical evidence of the acoustic universal structure in speech , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[11]  Xavier Pennec,et al.  A Riemannian Framework for Tensor Computing , 2005, International Journal of Computer Vision.

[12]  A. Mertins,et al.  Vocal tract length invariant features for automatic speech recognition , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[13]  W. Förstner,et al.  A Metric for Covariance Matrices , 2003 .

[14]  P. Forrester Eigenvalue distributions for some correlated complex sample covariance matrices , 2006, math-ph/0602001.

[15]  Roy D. Patterson,et al.  Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-Mellin transform , 2002, Speech Commun..

[16]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[17]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..