On the decorrelation of filter-bank energies in speech recognition

Cepstral coefficients are widely used in speech recognition. In this paper, we claim that they are not the best way of representing the spectral envelope, at least for some usual speech recognition systems. In fact, cepstrum has several disadvantages: poor physical meaning, need of transformation, and low capacity of adaptation to some recognition systems. In this paper, we propose a new representation that significantly outperforms both mel-cepstrum and LPC-cepstrum techniques in both recognition rate and computational cost. It consists of filtering the frequency sequence of filter-bank energies with an extremely simple filter that equalizes the variance of the cepstral coefficients. Excellent results of the new technique using a continuous observation density HMM recognition system and two very different recognition tasks, connected digits and phone recognition, are presented.

[1]  Dennis H. Klatt,et al.  Prediction of perceived phonetic distance from critical-band spectra: A first step , 1982, ICASSP.

[2]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[3]  B. Hanson,et al.  Spectral slope based distortion measures for all-pole models of speech , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Biing-Hwang Juang,et al.  On the use of bandpass liftering in speech recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[5]  Yoh'ichi Tohkura,et al.  A weighted cepstral distance measure for speech recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[6]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.