In this paper we present a dereverberation algorithm for improving automatic speech recognition (ASR) results with minimal CPU overhead. As the reverberation tail hurts ASR the most, late reverberation is reduced via gain-based spectral subtraction. We use a multi-band decay model with an efficient method to update it in realtime. In reverberant environments the multi-channel version of the proposed algorithm reduces word error rates (WER) up to one half of the way between those of a microphone array only and a close-talk microphone. The four channel implementation requires less than 2% of the CPU power of a modern computer. Introduction The need to present clean sound inputs to today's speech recognition engines has fostered huge amounts of research into areas of noise suppression, microphone array processing, acoustic echo cancellation and methods for reducing the effects of acoustic reverberation. Reducing reverberation through deconvolution (inverse filtering) is one of the most common approaches. The main problem is that the channel must be known or very well estimated for successful deconvolution. The estimation is done in the cepstral domain [1] or on envelope levels [2]. Multi-channel variants use the redundancy of the channel signals [3] and frequently work in the cepstral domain [4]. Blind dereverberation methods seek to estimate the input(s) to the system without explicitly computing a deconvolution or inverse filter. Most of them employ probabilistic and statistically based models [5]. Dereverberation via suppression and enhancement is similar to noise suppression. These algorithms either try to suppress the reverberation, enhance the direct-path speech, or both. There is no channel estimation and there is no signal estimation, either. Usual techniques are longterm cepstral mean subtraction [6], pitch enhancement [7], LPC analysis [8] in single or multi-channel implementation. The most common issues with the preceding methods are slow reaction when reverberation changes, robustness to noise, and computational requirements. Modeling and assumptions We convoluted clean speech signal with a typical room response function and processed it trough our ASR engine, cutting the length of the response function after some point. The results are shown on Figure 1. The early reverberation practically has no effect on the ASR results, most probably due to cepstral mean subtraction (CMS) in the ASR engine front end. The CMS compensates for the constant part of the input channel response and removes the early reverberation. The reverberation has noticeable effect on WER between 50 ms and RT30. In this time interval the reverberation behaves more as non-stationary, uncorrelated decaying noise ) ( f R : ) ( ) ( ) ( f f X f Y R + = (1) We assume that the reverberation energy in this time interval decays exponentially and is the same in every point of the room (i.e. it is diffuse). Our decay model is frequency dependent:
[1]
Henrique S. Malvar,et al.
Speech dereverberation via maximum-kurtosis subband adaptive filtering
,
2001,
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).
[2]
Athina P. Petropulu,et al.
Cepstrum-based deconvolution for speech dereverberation
,
1996,
IEEE Trans. Speech Audio Process..
[3]
Peter Kabal,et al.
Reverberant speech enhancement using cepstral processing
,
1991,
[Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.
[4]
A. Kondoz,et al.
Analysis and improvement of a statistical model-based voice activity detector
,
2001,
IEEE Signal Processing Letters.
[5]
Li Deng,et al.
Speech Denoising and Dereverberation Using Probabilistic Models
,
2000,
NIPS.
[6]
Henrique S. Malvar,et al.
Blind deconvolution of rever - berated speech signals
,
2001
.
[7]
Nelson Morgan,et al.
Double the trouble: handling noise and reverberation in far-field automatic speech recognition
,
2002,
INTERSPEECH.
[8]
DeLiang Wang,et al.
A one-microphone algorithm for reverberant speech enhancement
,
2003,
2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..
[9]
John Mourjopoulos,et al.
Modelling and enhancement of reverberant speech using an envelope convolution method
,
1983,
ICASSP.
[10]
Henrique S. Malvar,et al.
Blind deconvolution of reverberated speech signals via regularization
,
2001,
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).