Phase Processing for Single-Channel Speech Enhancement: History and recent advances

With the advancement of technology, both assisted listening devices and speech communication devices are becoming more portable and also more frequently used. As a consequence, users of devices such as hearing aids, cochlear implants, and mobile telephones, expect their devices to work robustly anywhere and at any time. This holds in particular for challenging noisy environments like a cafeteria, a restaurant, a subway, a factory, or in traffic. One way to making assisted listening devices robust to noise is to apply speech enhancement algorithms. To improve the corrupted speech, spatial diversity can be exploited by a constructive combination of microphone signals (so-called beamforming), and by exploiting the different spectro?temporal properties of speech and noise. Here, we focus on single-channel speech enhancement algorithms which rely on spectrotemporal properties. On the one hand, these algorithms can be employed when the miniaturization of devices only allows for using a single microphone. On the other hand, when multiple microphones are available, single-channel algorithms can be employed as a postprocessor at the output of a beamformer. To exploit the short-term stationary properties of natural sounds, many of these approaches process the signal in a time-frequency representation, most frequently the short-time discrete Fourier transform (STFT) domain. In this domain, the coefficients of the signal are complex-valued, and can therefore be represented by their absolute value (referred to in the literature both as STFT magnitude and STFT amplitude) and their phase. While the modeling and processing of the STFT magnitude has been the center of interest in the past three decades, phase has been largely ignored.

[1]  John H. L. Hansen,et al.  Speech enhancement using a constrained iterative sinusoidal model , 2001, IEEE Trans. Speech Audio Process..

[2]  Timo Gerkmann Bayesian Estimation of Clean Speech Spectral Coefficients Given a Priori Knowledge of the Phase , 2014, IEEE Transactions on Signal Processing.

[3]  Jonathan Le Roux,et al.  Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction , 2008, SAPA@INTERSPEECH.

[4]  Rainer Martin,et al.  Parameterized MMSE spectral magnitude estimation for the enhancement of noisy speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Bayya Yegnanarayana,et al.  Significance of group delay functions in spectrum estimation , 1992, IEEE Trans. Signal Process..

[6]  T. Houtgast,et al.  On the significance of phase in the short term Fourier spectrum for speech intelligibility. , 2010, The Journal of the Acoustical Society of America.

[7]  Volker Gnann,et al.  IMPROVING RTISI PHASE ESTIMATION WITH ENERGY ORDER AND PHASE UNWRAPPING , 2010 .

[8]  Kuldip K. Paliwal,et al.  Speech analysis using instantaneous frequency deviation , 2008, INTERSPEECH.

[9]  Pejman Mowlaee,et al.  Iterative Closed-Loop Phase-Aware Single-Channel Speech Enhancement , 2013, IEEE Signal Processing Letters.

[10]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[11]  Akihiko Sugiyama,et al.  Tapping-noise suppression with magnitude-weighted phase-based detection , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[12]  Kuldip K. Paliwal,et al.  The importance of phase in speech enhancement , 2011, Speech Commun..

[13]  Timo Gerkmann,et al.  MMSE-Optimal Spectral Amplitude Estimation Given the STFT-Phase , 2013, IEEE Signal Processing Letters.

[14]  Mark B. Sandler,et al.  A tutorial on onset detection in music signals , 2005, IEEE Transactions on Speech and Audio Processing.

[15]  Jonathan Le Roux,et al.  Consistent Wiener Filtering for Audio Source Separation , 2013, IEEE Signal Processing Letters.

[16]  Timo Gerkmann,et al.  STFT Phase Reconstruction in Voiced Speech for an Improved Single-Channel Speech Enhancement , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[18]  Hirokazu Kameoka,et al.  Phase initialization schemes for faster spectrogram-consistency-based signal reconstruction ∗ ◎ , 2010 .

[19]  Lonce L. Wyse,et al.  Real-Time Signal Estimation From Modified Short-Time Fourier Transform Magnitude Spectra , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Jae S. Lim,et al.  The unimportance of phase in speech enhancement , 1982 .

[21]  Jean Laroche,et al.  Improved phase vocoder time-scale modification of audio , 1999, IEEE Trans. Speech Audio Process..

[22]  Jonathan Le Roux,et al.  Factorial Models for Noise Robust Speech Recognition , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[23]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[24]  Nicolas Sturmel,et al.  Informed Source Separation Using Iterative Reconstruction , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Logan Volkers,et al.  PHASE VOCODER , 2008 .

[26]  Deep Sen,et al.  Iterative Phase Estimation for the Synthesis of Separated Sources From Single-Channel Mixtures , 2010, IEEE Signal Processing Letters.

[27]  Thomas F Quatieri,et al.  Phase estimation with application to speech analysis-synthesis , 1979 .

[28]  Peter Vary,et al.  Noise suppression by spectral magnitude estimation —mechanism and theoretical limits— , 1985 .

[29]  Jesper Jensen,et al.  DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement , 2013, DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement.

[30]  Nicolas Sturmel,et al.  Iterative phase reconstruction of wiener filtered signals , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).