Improved Speech Presence Probabilities Using HMM-Based Inference, With Applications to Speech Enhancement and ASR

This paper presents a technique for determining improved speech presence probabilities (SPPs), by exploiting the temporal correlation present in spectral speech data. Based on a set of traditional SPPs, we estimate the underlying speech presence probability via statistical inference. Traditional SPPs are assumed to be observations of channel-specific two-state Markov models. Corresponding steady-state and transitional statistics are set to capture the well-known temporal correlation of spectral speech data, and observation statistics are modeled based on the effect of additive acoustic noise on resulting SPPs. Once underlying models have been parameterized, improved speech presence probabilities can be estimated via traditional inference techniques, such as the forward or forward-backward algorithms. The two-state configuration of underlying signal models enables low complexity HMM-based processing, only slightly increasing complexity relative to standard SPPs, and thereby making the proposed framework attractive for resource-constrained scenarios. Proposed SPP masks are shown to provide a significant increase in accuracy relative to the state-of-the-art method of the paper by Cohen and Berdugo (“Speech enhancement for non-stationary noise environments,” Signal Processing, vol. 81, no. 11, pp. 2403-2418, 2001), in terms of the mean pointwise Kullback-Leibler (KL) distance. When applied to soft-decision speech enhancement, proposed SPPs show improved results in terms of segmental SNRs. Closer analysis reveals significantly decreased noise leakage, whereas speech distortion is increased. When applied to automatic speech recognition (ASR), the use of soft-decision enhancement with proposed SPPs provides increased recognition performance, relative to the paper by Cohen and Berdugo.

[1]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[2]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[3]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  Andrzej Drygajlo,et al.  Entropy based voice activity detection in very noisy conditions , 2001, INTERSPEECH.

[5]  I. Cohen Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator , 2002, IEEE Signal Processing Letters.

[6]  D. Pearce Enabling new speech driven services for mobile devices: an overview of the proposed etsi standard for a distributed speech recognition front-end , 1999 .

[7]  Constantine Kotropoulos,et al.  Voice Activity Detection with Generalized Gamma Distribution , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[8]  Rainer Martin,et al.  Improved A Posteriori Speech Presence Probability Estimation Based on a Likelihood Ratio With Fixed Priors , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[10]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[11]  Israel Cohen,et al.  Speech enhancement for non-stationary noise environments , 2001, Signal Process..

[12]  David Malah,et al.  Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[13]  Jacob Benesty,et al.  Speech Enhancement , 2010 .

[14]  Peter Vary,et al.  Digital Speech Transmission: Enhancement, Coding and Error Concealment , 2006 .

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Bin Chen,et al.  A Laplacian-based MMSE estimator for speech enhancement , 2007, Speech Commun..

[17]  R. McAulay,et al.  Speech enhancement using a soft-decision noise suppression filter , 1980 .

[18]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[19]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[20]  Darren Pearce,et al.  Enabling new speech driven services for mobile devices: An overview of the ETSI standards activities , 2000 .

[21]  Rainer Martin,et al.  SPEECH ENHANCEMENT IN THE DFT DOMAIN USING LAPLACIAN SPEECH PRIORS , 2003 .

[22]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[23]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.