Speech Spectral Envelope Enhancement by HMM-Based Analysis/Resynthesis

We propose a speech enhancement-by-resynthesis framework whose strength lies in a common statistical speech model that is shared by the analysis and synthesis stages. First, a spectro-temporal analysis is performed and masked spectro-temporal regions are identified using a noise model. Then, HMM synthesis is used to reconstruct the spectral envelope in masked regions in a manner which is conditioned on the reliable regions, preventing the resynthesis from regressing to the training data mean. As a demonstration we enhance noise-corrupted speech utterances from a small vocabulary corpus for which good statistical models are available. Perceptual evaluation of speech quality and log spectral distances demonstrate considerable performance improvements over baseline approaches that do not exploit strong speech knowledge. The letter is accompanied by audio examples.

[1]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[2]  Keiichi Tokuda,et al.  An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features , 1995, EUROSPEECH.

[3]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[4]  Robert M. Nickel,et al.  Speech Enhancement With Inventory Style Speech Resynthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[8]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..