Semi-supervised Speech Enhancement in Modulation Subspace

Previous studies show that existing speech enhancement algorithms can improve speech quality but not speech intelligibility. In this study, we propose a modulation subspace (MS) based speech enhancement framework, in which the spectrogram of noisy speech is decoupled as the product of a spectral envelop subspace and a spectral details subspace. This decoupling approach provides a method to specifically work on elimination of those noises that greatly affect the intelligibility. Two supervised low-rank and sparse decomposition schemes are developed in the spectral envelop subspace to obtain a robust recovery of speech components. A Bayesian formulation of non-negative factorization (NMF) is used to learn the speech dictionary from the spectral envelop subspace of clean speech samples. In the spectral details subspace, a standard robust principle component analysis (RPCA) is implemented to extract the speech components. The validation results show that compared with four state-of-the-art speech enhancement algorithms, including MMSE-SPP, NMF-RPCA, RPCA, and LARC, both proposed MS based algorithms achieve higher perceptual quality, and also demonstrate superiority on improving speech intelligibility.

[1]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[2]  Li Deng,et al.  Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features , 2004, IEEE Transactions on Speech and Audio Processing.

[3]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Kuldip K. Paliwal,et al.  Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator , 2012, Speech Commun..

[5]  Axel Röbel,et al.  On cepstral and all-pole based spectral envelope modeling with unknown model order , 2007, Pattern Recognit. Lett..

[6]  Daniel P. W. Ellis,et al.  Music-Content-Adaptive Robust Principal Component Analysis for a Semantically Consistent Separation of Foreground and Background in Music Audio Signals , 2014, DAFx.

[7]  Paris Smaragdis,et al.  Speech Enhancement by Online Non-negative Spectrogram Decomposition in Non-stationary Noise Environments , 2012, INTERSPEECH.

[8]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[9]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[10]  Yi-Hsuan Yang,et al.  Low-Rank Representation of Both Singing Voice and Music Accompaniment Via Learned Dictionaries , 2013, ISMIR.

[11]  Manfred R. Schroeder,et al.  Code-excited linear prediction(CELP): High-quality speech at very low bit rates , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  R.W. Schafer,et al.  From frequency to quefrency: a history of the cepstrum , 2004, IEEE Signal Processing Magazine.

[13]  Timo Gerkmann,et al.  Speech presence probability estimation based on temporal cepstrum smoothing , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Yi Hu,et al.  Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. , 2009, The Journal of the Acoustical Society of America.

[15]  Noureddine Ellouze,et al.  Speech enhancement based on wavelet packet of an improved principal component analysis , 2016, Comput. Speech Lang..

[16]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  G. de Krom A cepstrum-based technique for determining a harmonics-to-noise ratio in speech signals. , 1993, Journal of speech and hearing research.

[18]  Philipos C. Loizou,et al.  Reasons why Current Speech-Enhancement Algorithms do not Improve Speech Intelligibility and Suggested Solutions , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Hadi Veisi,et al.  Speech enhancement using hidden Markov models in Mel-frequency domain , 2013, Speech Commun..

[20]  Yi-Hsuan Yang,et al.  On sparse and low-rank matrix decomposition for singing voice separation , 2012, ACM Multimedia.

[21]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[22]  Nenghai Yu,et al.  Non-negative low rank and sparse graph for semi-supervised learning , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Frédéric E. Theunissen,et al.  The Modulation Transfer Function for Speech Intelligibility , 2009, PLoS Comput. Biol..

[24]  Rainer Martin,et al.  Improved A Posteriori Speech Presence Probability Estimation Based on a Likelihood Ratio With Fixed Priors , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Paris Smaragdis,et al.  Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[27]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[28]  Jesper Jensen,et al.  DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement , 2013, DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement.

[29]  Rainer Martin,et al.  Cepstral Smoothing of Spectral Filter Gains for Speech Enhancement Without Musical Noise , 2007, IEEE Signal Processing Letters.

[30]  John S. D. Mason,et al.  On the limitations of cepstral features in noise , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Qi Zhu,et al.  A novel speech enhancement method based on constrained low-rank and sparse matrix decomposition , 2014, Speech Commun..

[32]  Daniel P. W. Ellis,et al.  Speech enhancement by sparse, low-rank, and dictionary spectrogram decomposition , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[33]  Paris Smaragdis,et al.  Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Kuldip K. Paliwal,et al.  Single-channel speech enhancement using spectral subtraction in the short-time modulation domain , 2010, Speech Commun..

[35]  Richard C. Hendriks,et al.  Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  James M. Kates,et al.  The Hearing-Aid Speech Quality Index (HASQI) Version 2 , 2014 .

[38]  Alvin M. Liberman,et al.  Speech: A Special Code , 1996 .

[39]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[40]  Zhao Kang,et al.  Robust PCA Via Nonconvex Rank Approximation , 2015, 2015 IEEE International Conference on Data Mining.

[41]  Hong Kook Kim,et al.  Cepstrum-domain acoustic feature compensation based on decomposition of speech and noise for ASR in noisy environments , 2001, IEEE Trans. Speech Audio Process..

[42]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Joachim M. Buhmann,et al.  Speech Enhancement Using Generative Dictionary Learning , 2012, IEEE Transactions on Audio, Speech, and Language Processing.