Speech recognition in non-stationary adverse environments

We introduce a new approach, called non-stationary adaptation (NA), to recognize speech under non-stationary adverse environments. Two models are used: one is a speaker-independent hidden Markov model (HMM) for clean speech, the other is an ergodic Markov chain representing the non-stationary adverse environment. Each state in the Markov chain represents one stationary adverse condition and has associated with it an affine transform that is estimated by maximum likelihood linear regression (MLLR). Three kinds of adverse environments are considered: (i) multi-speaker speech recognition where the speaker identity changes randomly and this constitutes a non-stationary adverse condition, (ii) the recognition of speech corrupted by machinegun noise, and (iii) the crosstalk problem. The algorithm is tested on the Nov92 development database of WSJF0 with a vocabulary size of 20000. In multi-speaker speech recognition, NA decreases the error rate by 13.6%. For speech corrupted by machinegun noise, a one-state Markov chain decreases the error rate by 18%, and a two-state Markov chain gives another 14% decrease in error rate. In the crosstalk problem, a one-state Markov chain decreases the error rate by 16.8%. Two-state and three-state Markov chains decrease the error rate by 22% and 24.4%, respectively.

[1]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[2]  Jérôme Boudy,et al.  Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars , 1991, Speech Commun..

[3]  Radar Establishment HIDDEN MARKOV MODEL DECOMPOSITION OF SPEECH AND NOISE , 1990 .

[4]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[5]  L. Baum,et al.  An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology , 1967 .

[6]  Mark J. F. Gales,et al.  Improving environmental robustness in large vocabulary speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Michael Picheny,et al.  Influence of background noise and microphone on the performance of the IBM Tangora speech recognition system , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[9]  Mark J. F. Gales,et al.  Robust speech recognition in additive and convolutional noise using parallel model combination , 1995, Comput. Speech Lang..

[10]  Jean-Luc Gauvain,et al.  Developments in continuous speech dictation using the 1995 ARPA NAB news task , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[11]  Régis Cardin,et al.  Inter-word coarticulation modeling and MMIE training for improved connected digit recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[13]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..