Single-channel speech enhancement with correlated spectral components: Limits-potential

Abstract In this paper, we investigate single-channel speech enhancement algorithms that operate in the short-time Fourier transform and take into account dependencies w.r.t. frequency. As a result of allowing for inter-frequency dependencies, the minimum mean square error optimal estimates of the short-time Fourier transform expansion coefficients are functions of complex-valued covariance matrices in general. The covariance matrices are not known a priori and have to be estimated from the observed data. This work is dedicated to analyzing how this affects the respective single-channel speech enhancement algorithms. We propose a statistical model that circumvents the need to estimate complex-valued second order statistics and derive a linear multidimensional short-time spectral amplitude estimator that is motivated by these assumptions. Further, we provide empirical evidence for the assumptions that form the basis of this model. We evaluate the potential of taking into account inter-frequency dependencies for single-channel speech enhancement and subsequently compare the estimator resulting from the proposed statistical model to relevant benchmark methods. The results indicate that estimators that consider inter-frequency dependencies are capable of pushing the limits of standard approaches in terms of joint speech quality and intelligibility improvement when the second order statistics are estimated from isolated speech data. The proposed linear multidimensional short-time spectral amplitude estimator preserves this trend in fully blind scenarios.

[1]  Nicholas I. Fisher,et al.  Statistical Analysis of Spherical Data. , 1987 .

[2]  Andreas Ziehe,et al.  An approach to blind source separation based on temporal structure of speech signals , 2001, Neurocomputing.

[3]  Thomas F. Quatieri,et al.  Phase coherence in speech reconstruction for enhancement and coding applications , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[4]  Tim Fingscheidt,et al.  Black box measurement of musical tones produced by noise reduction systems , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[6]  Kiyohiro Shikano,et al.  Theoretical Analysis of Musical Noise in Generalized Spectral Subtraction Based on Higher Order Statistics , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Rainer Martin,et al.  Speech enhancement based on minimum mean-square error estimation and supergaussian priors , 2005, IEEE Transactions on Speech and Audio Processing.

[8]  Akihiko Sugiyama,et al.  Phase randomization - A new paradigm for single-channel signal enhancement , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Richard Heusdens,et al.  On the Estimation of Complex Speech DFT Coefficients Without Assuming Independent Real and Imaginary Parts , 2008, IEEE Signal Processing Letters.

[11]  Wouter Tirry,et al.  DNN-Supported Speech Enhancement With Cepstral Estimation of Both Excitation and Envelope , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Olivier Cappé,et al.  Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor , 1994, IEEE Trans. Speech Audio Process..

[13]  Chunjian Li,et al.  Inter-frequency dependency in mmse speech enhancement , 2004, Proceedings of the 6th Nordic Signal Processing Symposium, 2004. NORSIG 2004..

[14]  Richard C. Hendriks,et al.  Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Eric Plourde,et al.  Multidimensional STSA Estimators for Speech Enhancement With Correlated Spectral Components , 2011, IEEE Transactions on Signal Processing.

[16]  Søren Vang Andersen,et al.  A Block-Based Linear MMSE Noise Reduction with a High Temporal Resolution Modeling of the Speech Excitation , 2005, EURASIP J. Adv. Signal Process..

[17]  Kiyohiro Shikano,et al.  Automatic optimization scheme of spectral subtraction based on musical noise assessment via higher-order statistics , 2008 .

[18]  Eric Plourde,et al.  A family of Bayesian STSA estimators for the enhancement of speech with correlated frequency components , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  B. Carlson Covariance matrix estimation errors and diagonal loading in adaptive arrays , 1988 .

[20]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[21]  Timo Gerkmann Bayesian Estimation of Clean Speech Spectral Coefficients Given a Priori Knowledge of the Phase , 2014, IEEE Transactions on Signal Processing.

[22]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[23]  Tim Fingscheidt,et al.  Towards objective quality assessment of speech enhancement systems in a black box approach , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Peter Vary,et al.  Digital Speech Transmission: Enhancement, Coding and Error Concealment , 2006 .

[25]  Kiyohiro Shikano,et al.  Musical noise generation analysis for noise reduction methods based on spectral subtraction and MMSE STSA estimation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Tim Fingscheidt,et al.  Environment-Optimized Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Jacob Benesty,et al.  Single-channel noise reduction in the STFT domain based on the bifrequency spectrum , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Paul R. White,et al.  Speech spectral amplitude estimators using optimally shaped Gamma and Chi priors , 2009, Speech Commun..

[29]  Hiroshi Sawada,et al.  A robust and precise method for solving the permutation problem of frequency-domain blind source separation , 2004, IEEE Transactions on Speech and Audio Processing.

[30]  Emanuel A. P. Habets,et al.  Conditional MMSE-based single-channel speech enhancement using inter-frame and inter-band correlations , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Pejman Mowlaee Begzade Mahale,et al.  Harmonic phase estimation in single-channel speech enhancement using von mises distribution and prior SNR , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[33]  Tim Fingscheidt,et al.  A Weighted Log Kurtosis Ratio Measure for Instrumental Musical Tones Assessment in Wideband Speech , 2012, ITG Conference on Speech Communication.

[34]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[35]  Christophe Beaugeant,et al.  Overcoming the statistical independence assumption w.r.t. frequency in speech enhancement , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[36]  Jon Barker,et al.  The second ‘CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.