Comparison and combination of features in a hybrid HMM/MLP and a HMM/GMM speech recognition system

Recently, the advantages of the spectral parameters obtained by frequency filtering (FF) of the logarithmic filter-bank energies (logFBEs) have been reported. These parameters, which are frequency derivatives of the logFBEs, lie in the frequency domain, and have shown good recognition performance with respect to the conventional mel-frequency cepstral coefficients (MFCCs) for hidden Markov models (HMM) based systems. In this paper, the FF features are first compared with the MFCCs and the relative spectral perceptual linear prediction (Rasta-PLP) features using both a hybrid HMM/MLP and a usual HMM/Gaussian mixture models (HMM/GMM) based recognition system, for both clean and noisy speech. Taking advantage of the ability of the hybrid system to deal with correlated features, the inclusion of both the frequency second-derivatives and the raw logFBEs as additional features is proposed and tested. Moreover, the robustness of these features in noisy conditions is enhanced by combining the FF technique with the Rasta temporal filtering approach. Finally, a study of the FF features in the framework of multistream processing is presented. The best recognition results for both clean and noisy speech are obtained from the multistream combination of the J-Rasta-PLP features and the FF features.

[1]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[2]  Hynek Hermansky,et al.  Integrating RASTA-PLP into speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Climent Nadeu,et al.  Time and frequency filtering of filter-bank energies for robust HMM speech recognition , 2000, Speech Commun..

[4]  Hervé Bourlard,et al.  Connectionist speech recognition , 1993 .

[5]  Jeff A. Bilmes,et al.  COMBINATION AND JOINT TRAINING OF ACOUSTIC CLASSIFIERS FOR SPEECH RECOGNITION , 2000 .

[6]  Dennis H. Klatt,et al.  Prediction of perceived phonetic distance from critical-band spectra: A first step , 1982, ICASSP.

[7]  Kuldip K. Paliwal,et al.  DECORRELATED AND LIFTERED FILTER-BANK ENERGIES FOR ROBUST SPEECH RECOGNITION , 1999 .

[8]  Climent Nadeu,et al.  On the decorrelation of filter-bank energies in speech recognition , 1995, EUROSPEECH.

[9]  Daniel P. W. Ellis,et al.  Investigations into tandem acoustic modeling for the Aurora task , 2001, INTERSPEECH.

[10]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[11]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[12]  Astrid Hagen Robust speech recognition based on multi-stream processing , 2001 .

[13]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[14]  Ronald A. Cole,et al.  New telephone speech corpora at CSLU , 1995, EUROSPEECH.