A sparse representation approach for perceptual quality improvement of separated speech

Speech separation based on time-frequency masking has been shown to improve intelligibility of speech signals corrupted by noise. A perceived weakness of binary masking is the quality of separated speech. In this paper, an approach for improving the perceptual quality of separated speech from binary masking is proposed. Our approach consists of two stages, where a binary mask is generated in the first stage that effectively performs speech separation. In the second stage, a sparse-representation approach is used to represent the separated signal by a linear combination of Short-time Fourier Transform (STFT) magnitudes that are generated from a clean speech dictionary. Overlap-and-add synthesis is then used to generate an estimate of the speech signal. The performance of the proposed approach is evaluated with the Perceptual Evaluation of Speech Quality (PESQ), which is a standard objective speech quality measure. The proposed algorithm offers considerable improvements in speech quality over binary-masked noisy speech and other reconstruction approaches.

[1]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[2]  Bert Cranen,et al.  Using sparse representations for missing data imputation in noise robust speech recognition , 2008, 2008 16th European Signal Processing Conference.

[3]  Hiroshi Sawada,et al.  Reducing musical noise by a fine-shift overlap-add method applied to source separation using a time-frequency mask , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[4]  Richard M. Dansereau,et al.  Monaural speech segregation based on fusion of source-driven with model-driven techniques , 2007, Speech Commun..

[5]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[7]  DeLiang Wang,et al.  CASA-Based Robust Speaker Identification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[9]  Michael Elad,et al.  Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries , 2006, IEEE Transactions on Image Processing.

[10]  DeLiang Wang,et al.  Time-Frequency Masking for Speech Separation and Its Potential for Hearing Aid Design , 2008 .

[11]  Feng Liu,et al.  Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries in Wavelet Domain , 2009, 2009 Fifth International Conference on Image and Graphics.

[12]  Rainer Martin,et al.  Temporal smoothing of spectral masks in the cepstral domain for speech separation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Tomi Kinnunen,et al.  A Joint Approach for Single-Channel Speaker Identification and Speech Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  WangDeLiang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013 .

[15]  Michael Elad,et al.  Sparse Representation for Color Image Restoration , 2008, IEEE Transactions on Image Processing.