Compressive speech enhancement in the modulation domain

Abstract Compressive speech enhancement (CSE) has gained popularity in recent years as it bypasses the need for noise estimation. Parallel to that, modulation domain has been widely studied in speech applications as it offers a more compact representation and is closely associated with speech intelligibility enhancement. Motivated by the development in modulation domain and CSE, this paper seeks to explore the suitability of modulation domain based sparse reconstruction for use in CSE. The main idea is to study if the increased sparsity in the modulation domain would benefit sparse reconstruction in CSE. The findings reveal that modulation transformation is sparser and offers a stronger restricted isometry property (RIP) compared to the frequency transformation, which is essential for sparse recovery with a high probability. The results are then extended to show that the sparse reconstruction error in the modulation domain is upper bounded by the frequency domain. Experimental results in a CSE setting concur with the theoretical derivations, with modulation domain CSE outperforming the frequency domain CSE through different speech quality measures.

[1]  Sven Nordholm,et al.  Bayesian noise estimation in the modulation domain , 2018, Speech Commun..

[2]  Tiago H. Falk,et al.  Automatic speech emotion recognition using modulation spectral features , 2011, Speech Commun..

[3]  Abeer Alwan,et al.  Temporal modulation processing of speech signals for noise robust ASR , 2009, INTERSPEECH.

[4]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Olgica Milenkovic,et al.  Subspace Pursuit for Compressive Sensing Signal Reconstruction , 2008, IEEE Transactions on Information Theory.

[6]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[7]  H. Dudley The carrier nature of speech , 1940 .

[8]  Mike Brookes,et al.  Model-Based Speech Enhancement in the Modulation Domain , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Yi Zhang,et al.  Modulation domain blind speech separation in noisy environments , 2013, Speech Commun..

[10]  D. L. Donoho,et al.  Compressed sensing , 2006, IEEE Trans. Inf. Theory.

[11]  HYNEK HERMANSKY,et al.  Speech recognition from spectral dynamics , 2011 .

[12]  D. Donoho,et al.  Atomic Decomposition by Basis Pursuit , 2001 .

[13]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[14]  Marcelo O Magnasco,et al.  Sparse time-frequency representations , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Marc Moonen,et al.  Sparse Linear Prediction and Its Applications to Speech Processing , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale $\ell_1$-Regularized Least Squares , 2007, IEEE Journal of Selected Topics in Signal Processing.

[17]  Robert C. Thompson The eigenvalue spreads of a hermitian matrix and its principal submatrices , 1992 .

[18]  Volkan Cevher,et al.  Model-Based Compressive Sensing , 2008, IEEE Transactions on Information Theory.

[19]  E. Candès The restricted isometry property and its implications for compressed sensing , 2008 .

[20]  Kamil K. Wójcicki,et al.  Channel selection in the modulation domain for improved speech intelligibility in noise. , 2012, The Journal of the Acoustical Society of America.

[21]  Kuldip K. Paliwal,et al.  Single-channel speech enhancement using spectral subtraction in the short-time modulation domain , 2010, Speech Commun..

[22]  Jacob Benesty,et al.  Speech Enhancement , 2010 .

[23]  Albert Wang,et al.  The In-Crowd Algorithm for Fast Basis Pursuit Denoising , 2011, IEEE Transactions on Signal Processing.

[24]  Lin-Shan Lee,et al.  Modulation Spectrum Equalization for Improved Robust Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Scott T. Rickard,et al.  Comparing Measures of Sparsity , 2008, IEEE Transactions on Information Theory.

[26]  Sven Nordholm,et al.  A multi-decision sub-band voice activity detector , 2006, 2006 14th European Signal Processing Conference.

[27]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[28]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[29]  Torsten Dau,et al.  Comparing the Influence of Spectro-Temporal Integration in Computational Speech Segregation , 2016, INTERSPEECH.

[30]  Les E. Atlas,et al.  EURASIP Journal on Applied Signal Processing 2003:7, 668–675 c ○ 2003 Hindawi Publishing Corporation Joint Acoustic and Modulation Frequency , 2003 .

[31]  Shai Avidan,et al.  Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms , 2005, NIPS.

[32]  Ljubisa Stankovic,et al.  Reconstruction of Sparse and Nonsparse Signals from a Reduced Set of Samples , 2015, ArXiv.

[33]  Frederick J. Gallun,et al.  Exploring the Role of the Modulation Spectrum in Phoneme Recognition , 2008, Ear and hearing.

[34]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[35]  Marco Righero,et al.  An introduction to compressive sensing , 2009 .

[36]  Kuldip K. Paliwal,et al.  Using STFT real and imaginary parts of modulation signals for MMSE-based speech enhancement , 2014, Speech Commun..

[37]  Svetha Venkatesh,et al.  Compressive speech enhancement , 2013, Speech Commun..

[38]  Dinh-Tuan Pham,et al.  Modeling the Short Time Fourier Transform Ratio and Application to Underdetermined Audio Source Separation , 2009, ICA.

[39]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[40]  Emmanuel J. Candès,et al.  Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? , 2004, IEEE Transactions on Information Theory.

[41]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Kuldip K. Paliwal,et al.  Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator , 2012, Speech Commun..