Exploiting the baseband phase structure of the voiced speech for speech enhancement

Performance of traditional speech enhancement techniques like spectral subtraction and log-Minimum Mean Squared Error Short Time Spectral Amplitude (log-MMSE STSA) estimation degrades in presence of highly non-stationary noises like babble noise. This is mainly due to inaccurate noise estimation during the voiced segment of the speech signal. In this paper, we propose to exploit the fine structure of the phase spectra of the voiced speech in the baseband STFT domain. This phase structure is used to detect the noise dominant frequency bins in the voiced frames. This information is used to achieve better non-stationary noise Power Spectral Density (PSD) estimation. Using this estimation, performance of spectral subtraction and log-MMSE STSA is improved overall by 0.3 and 0.2, respectively, in terms of Perceptual Evaluation of Speech Quality (PESQ) measure over the original algorithms when noisy speech is used for pitch estimation. We also present the combination of these two algorithms (spectral subtraction and log-MMSE STSA) to achieve the overall PESQ improvement of 0.5 over standard log-MMSE STSA when accurate pitch estimation is available.

[1]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[2]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[3]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[5]  Timo Gerkmann,et al.  STFT Phase Improvement for Single Channel Speech Enhancement , 2012, IWAENC.

[6]  Kuldip K. Paliwal,et al.  On the usefulness of STFT phase spectrum in human listening tests , 2005, Speech Commun..

[7]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  I. Cohen,et al.  Noise estimation by minima controlled recursive averaging for robust speech enhancement , 2002, IEEE Signal Processing Letters.

[9]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[10]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[11]  Alexander Fischer,et al.  Quantile based noise estimation for spectral subtraction and Wiener filtering , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[12]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[13]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[14]  Yi Zhang,et al.  Spectral subtraction on real and imaginary modulation spectra , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).