Robust automatic speech recognition in low-SNR car environments by the application of a connectionist subspace-based approach to the melbased cepstral coefficients

ABSTRACTIn this paper, the problem of robust large-vocabulary continuous-speech recognition (CSR) in the presence of highly interferingcar noise has been considered. Our approach is based on thenoise reduction of the parameters that we use for recognition,that is, the Mel-based cepstral coefficients. This is achieved bythe use of a Multilayer Perceptron (MLP) network for noise re-duction in the cepstral domain in order to get less-variant pa-rameters. Then, the obtained enhanced features are refined viathe Karhunen-Lo`eve Transform (KLT) implemented using thePrincipal Component Analysis (PCA). Experiments show thatthe use of the enhanced parameters using such an approach in-creases the recognition rate of the CSR process in highly inter-fering car noise environments. The HTK Hidden Markov ModelToolkit was used throughout our experiments. Results show thatthe proposed hybrid technique when included in the front-endof an HTK-based CSR system, outperforms that of the conven-tional recognition process based on either a KLT- or an MLP-based preprocessing recognition in severe interfering car noiseenvironments for a wide range of SNRs varying from 16 dB to-4 dB using a noisy version of the TIMIT database.1. INTRODUCTIONThe performance of existing CSR systems, whose designs arepredicated on relatively noise-free conditions, degrades rapidlyin the presence of a high level of adverse conditions. Several ap-proaches have been studied for achieving noise robustness [1, 2].In this paper, we focus on optimizing the performance of a CSRsystem by choosing a suitable distortion measure. The idea ofa robust distance measure is to extract relevant features fromspeech signals which must be insensitive to degradations of thespeech signal due to interfering noise or distortions. Many ap-proaches [3] have been used to extract relevant features froma speech signal. Cepstral parameters are well suited to speechrecognition due to their compact orthogonality. Unfortunately,cepstral features are highly sensitive to noise. It was shown in[4] that cepstral distributions for clean data are well behaved andapproximately normal, but in the presence of noise, their profilesare changed significantly and this consequently degrades the per-formance of an CSR system. However, the cepstrum coefficientshave the additional advantage that one can derive from them aset of parameters which are invariant to any fixed frequency-response distortion introduced by either the adverse environ-ments or thetransmission channels. Severalapproaches toobtaina new set of robust parameters were introduced in [5, 6, 7].In this paper, we propose a novel robust CSR system to be usedin car noisy environments. Our approach for noise reduction isapplied in the cepstral domain. It is based on the application ofa combination of the Karhunen-Lo`eve Transform (KLT) and aConnectionist approach. Each of these two approaches has beensuccessfully used in both speech enhancement and recognitionprocesses. We show in this paper through experiments on highlynoisy data that a cepstral noise reduction can be obtained us-ing such an approach and consequently an improvement of therecognition performance.This paper will be organized into the following sections. In sec-tion 2 we describe the basis of the MLP network and the PCAapproaches that will be used to describe our proposed hybridPCA-MLP approach. Then, we proceed in section 3 with the de-scription of the database, the platform used in our experimentsand the evaluation of the proposed MLP-PCA-based recognizerin a noisy car environment and the comparison of such a recog-nizer to both the MLP- and the PCA-based recognizers in orderto evaluate its performance. Finally, in section 5 we concludeand discuss our results.2. PROPOSED ENHANCEMENT APPROACH2.1. Multilayer Perceptron NetworkAs mentioned above, the first step that has been proposed to im-prove the performance of the CSRprocess in highly noisy caren-vironments in the cepstral domain is the use of a multilayer per-ceptron (MLP) network. The fact that the noise and the speechsignal are combined in a nonlinear way in the cepstral domainmotivated us to choose the MLP, since it can approximate the re-quired nonlinear function to some extent [6, 7]. The input of theMLP is the noisy MFCC vector

[1]  H.B.D. Sorensen,et al.  A cepstral noise reduction multi-layer neural network , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Climent Nadeu,et al.  A comparative study of parameters and distances for noisy speech recognition , 1991, EUROSPEECH.

[3]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[4]  Richard M. Stern,et al.  Environment normalization for robust speech recognition using direct cepstral comparison , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Franco Scarselli,et al.  Are Multilayer Perceptrons Adequate for Pattern Recognition and Verification? , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Jean-Marc Vesin,et al.  Single channel speech enhancement using principal component analysis and MDL subspace selection , 1999, EUROSPEECH.

[7]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition , 1996 .

[8]  John S. D. Mason,et al.  On the limitations of cepstral features in noise , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Erkki Oja,et al.  Neural Networks, Principal Components, and Subspaces , 1989, Int. J. Neural Syst..

[10]  Jukka Saarinen,et al.  MLP network for enhancement of noisy MFCC vectors , 1999, EUROSPEECH.