论文信息 - Semi-supervised Speech Enhancement in Modulation Subspace

Semi-supervised Speech Enhancement in Modulation Subspace

Previous studies show that existing speech enhancement algorithms can improve speech quality but not speech intelligibility. In this study, we propose a modulation subspace (MS) based speech enhancement framework, in which the spectrogram of noisy speech is decoupled as the product of a spectral envelop subspace and a spectral details subspace. This decoupling approach provides a method to specifically work on elimination of those noises that greatly affect the intelligibility. Two supervised low-rank and sparse decomposition schemes are developed in the spectral envelop subspace to obtain a robust recovery of speech components. A Bayesian formulation of non-negative factorization (NMF) is used to learn the speech dictionary from the spectral envelop subspace of clean speech samples. In the spectral details subspace, a standard robust principle component analysis (RPCA) is implemented to extract the speech components. The validation results show that compared with four state-of-the-art speech enhancement algorithms, including MMSE-SPP, NMF-RPCA, RPCA, and LARC, both proposed MS based algorithms achieve higher perceptual quality, and also demonstrate superiority on improving speech intelligibility.

Jun Qin | Pengfei Sun

[1] Yi Ma,et al. Robust principal component analysis? , 2009, JACM.

[2] Li Deng,et al. Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features , 2004, IEEE Transactions on Speech and Audio Processing.

[3] Yi Hu,et al. Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Kuldip K. Paliwal,et al. Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator , 2012, Speech Commun..

[5] Axel Röbel,et al. On cepstral and all-pole based spectral envelope modeling with unknown model order , 2007, Pattern Recognit. Lett..

[6] Daniel P. W. Ellis,et al. Music-Content-Adaptive Robust Principal Component Analysis for a Semantically Consistent Separation of Foreground and Background in Music Audio Signals , 2014, DAFx.

[7] Paris Smaragdis,et al. Speech Enhancement by Online Non-negative Spectrogram Decomposition in Non-stationary Noise Environments , 2012, INTERSPEECH.

[8] Rainer Martin,et al. Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[9] Thomas F. Quatieri,et al. Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[10] Yi-Hsuan Yang,et al. Low-Rank Representation of Both Singing Voice and Music Accompaniment Via Learned Dictionaries , 2013, ISMIR.

[11] Manfred R. Schroeder,et al. Code-excited linear prediction(CELP): High-quality speech at very low bit rates , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12] R.W. Schafer,et al. From frequency to quefrency: a history of the cepstrum , 2004, IEEE Signal Processing Magazine.

[13] Timo Gerkmann,et al. Speech presence probability estimation based on temporal cepstrum smoothing , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14] Yi Hu,et al. Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. , 2009, The Journal of the Acoustical Society of America.

[15] Noureddine Ellouze,et al. Speech enhancement based on wavelet packet of an improved principal component analysis , 2016, Comput. Speech Lang..

[16] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17] G. de Krom. A cepstrum-based technique for determining a harmonics-to-noise ratio in speech signals. , 1993, Journal of speech and hearing research.

[18] Philipos C. Loizou,et al. Reasons why Current Speech-Enhancement Algorithms do not Improve Speech Intelligibility and Suggested Solutions , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19] Hadi Veisi,et al. Speech enhancement using hidden Markov models in Mel-frequency domain , 2013, Speech Commun..

[20] Yi-Hsuan Yang,et al. On sparse and low-rank matrix decomposition for singing voice separation , 2012, ACM Multimedia.

[21] Rajat Raina,et al. Efficient sparse coding algorithms , 2006, NIPS.

[22] Nenghai Yu,et al. Non-negative low rank and sparse graph for semi-supervised learning , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23] Frédéric E. Theunissen,et al. The Modulation Transfer Function for Speech Intelligibility , 2009, PLoS Comput. Biol..

[24] Rainer Martin,et al. Improved A Posteriori Speech Presence Probability Estimation Based on a Likelihood Ratio With Fixed Priors , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[25] Paris Smaragdis,et al. Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[26] Israel Cohen,et al. Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[27] Philipos C. Loizou,et al. Speech Enhancement: Theory and Practice , 2007 .

[28] Jesper Jensen,et al. DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement , 2013, DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement.

[29] Rainer Martin,et al. Cepstral Smoothing of Spectral Filter Gains for Speech Enhancement Without Musical Noise , 2007, IEEE Signal Processing Letters.

[30] John S. D. Mason,et al. On the limitations of cepstral features in noise , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[31] Qi Zhu,et al. A novel speech enhancement method based on constrained low-rank and sparse matrix decomposition , 2014, Speech Commun..

[32] Daniel P. W. Ellis,et al. Speech enhancement by sparse, low-rank, and dictionary spectrogram decomposition , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[33] Paris Smaragdis,et al. Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34] Kuldip K. Paliwal,et al. Single-channel speech enhancement using spectral subtraction in the short-time modulation domain , 2010, Speech Commun..

[35] Richard C. Hendriks,et al. Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[36] Jesper Jensen,et al. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[37] James M. Kates,et al. The Hearing-Aid Speech Quality Index (HASQI) Version 2 , 2014 .

[38] Alvin M. Liberman,et al. Speech: A Special Code , 1996 .

[39] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[40] Zhao Kang,et al. Robust PCA Via Nonconvex Rank Approximation , 2015, 2015 IEEE International Conference on Data Mining.

[41] Hong Kook Kim,et al. Cepstrum-domain acoustic feature compensation based on decomposition of speech and noise for ASR in noisy environments , 2001, IEEE Trans. Speech Audio Process..

[42] Tuomas Virtanen,et al. Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[43] Joachim M. Buhmann,et al. Speech Enhancement Using Generative Dictionary Learning , 2012, IEEE Transactions on Audio, Speech, and Language Processing.