Variational Speech Separation of More Sources than Mixtures

We present a novel structured variational inference algorithm for probabilistic speech separation. The algorithm is built upon a new generative probability model of speech production and mixing in the full spectral domain, that utilizes a detailed probability model of speech trained in the magnitude spectral domain, and the position ensemble of the underlying sources as a natural, low-dimensional parameterization of the mixing process. The algorithm is able to produce high quality estimates of the underlying source configurations, even when there are more underlying sources than available microphone recordings. Spectral phase estimates of all underlying speakers are automatically recovered by the algorithm, facilitating the direct transformation of the obtained source estimates into the time domain, to yield speech signals of high perceptual quality. Audio demonstrations at http:// www.comm.utoronto.ca/∼rennie/srcsep

[1]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[2]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[3]  Hagai Attias,et al.  New EM algorithms for source separation and deconvolution with a microphone array , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[4]  Brendan J. Frey,et al.  Learning Dynamic Noise Models from Noisy Speech for Robust Speech Recognition , 2001 .

[5]  Parham Aarabi,et al.  EURASIP Journal on Applied Signal Processing 2003:4, 338–347 c ○ 2003 Hindawi Publishing Corporation The Fusion of Distributed Microphone Arrays for Sound Localization , 2002 .

[6]  Brendan J. Frey,et al.  Robust variational speech separation using fewer microphones than speakers , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[7]  Volker Hohmann,et al.  Computational auditory scene analysis by using statistics of high-dimensional speech dynamics and sound source direction , 2003, INTERSPEECH.

[8]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[9]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[10]  Alex Acero,et al.  Speech/noise separation using two microphones and a VQ model of speech signals , 2000, INTERSPEECH.

[11]  Lawrence R. Rabiner,et al.  A minimum discrimination information approach for hidden Markov modeling , 1989, IEEE Trans. Inf. Theory.

[12]  Parham Aarabi,et al.  Phase-based dual-microphone robust speech enhancement , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[13]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[14]  Brendan J. Frey,et al.  Probabilistic Inference of Speech Signals from Phaseless Spectrograms , 2003, NIPS.