Probabilistic Inference of Speech Signals from Phaseless Spectrograms

Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers.

[1]  Brendan J. Frey,et al.  Probability Propagation and Iterative Decoding , 1996 .

[2]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[3]  R. Fletcher Practical Methods of Optimization , 1988 .

[4]  Yann LeCun,et al.  Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch , 2002, NIPS.

[5]  J. Besag On the Statistical Analysis of Dirty Pictures , 1986 .

[6]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[7]  A. Wilgus,et al.  High quality time-scale modification for speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Eric A. Wan,et al.  Removal of noise from speech using the dual EKF algorithm , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[10]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.