Blind separation of speech mixtures via time-frequency masking

Binary time-frequency masks are powerful tools for the separation of sources from a single mixture. Perfect demixing via binary time-frequency masks is possible provided the time-frequency representations of the sources do not overlap: a condition we call W-disjoint orthogonality. We introduce here the concept of approximate W-disjoint orthogonality and present experimental results demonstrating the level of approximate W-disjoint orthogonality of speech in mixtures of various orders. The results demonstrate that there exist ideal binary time-frequency masks that can separate several speech signals from one mixture. While determining these masks blindly from just one mixture is an open problem, we show that we can approximate the ideal masks in the case where two anechoic mixtures are provided. Motivated by the maximum likelihood mixing parameter estimators, we define a power weighted two-dimensional (2-D) histogram constructed from the ratio of the time-frequency representations of the mixtures that is shown to have one peak for each source with peak location corresponding to the relative attenuation and delay mixing parameters. The histogram is used to create time-frequency masks that partition one of the mixtures into the original sources. Experimental results on speech mixtures verify the technique. Example demixing results can be found online at http://alum.mit.edu/www/rickard/bss.html.

[1]  A. Zeira,et al.  Gabor representation and signal detection , 1998 .

[2]  W. Kozek,et al.  Time-frequency signal processing based on the Wigner-Weyl framework , 1992, Signal Process..

[3]  S. Rickard,et al.  DESPRIT - histogram based blind source separation of more sources than sensors using subspace methods , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[4]  Jie Huang,et al.  A biomimetic system for localization and separation of multiple sound sources , 1994 .

[5]  Barak A. Pearlmutter,et al.  Blind Source Separation by Sparse Decomposition in a Signal Dictionary , 2001, Neural Computation.

[6]  Terrence J. Sejnowski,et al.  Blind source separation of more sources than mixtures using overcomplete representations , 1999, IEEE Signal Processing Letters.

[7]  Boualem Boashash,et al.  Separating More Sources Than Sensors Using Time-Frequency Distributions , 2005, EURASIP J. Adv. Signal Process..

[8]  R. Mersereau,et al.  Multiple Access Frequency Hopping Patterns with Low Ambiguity , 1981, IEEE Transactions on Aerospace and Electronic Systems.

[9]  Rémi Gribonval Sparse decomposition of stereo signals with Matching Pursuit and application to blind separation of more than two sources from a stereo mixture , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Haim Azhari,et al.  Speakers' direction finding using estimated time delays in the frequency domain , 2002, Signal Process..

[11]  Pau Bofill,et al.  Underdetermined blind separation of delayed sound sources in the frequency domain , 2003, Neurocomputing.

[12]  Radu Balan,et al.  Statistical properties of STFT ratios for two channel systems and applications to blind source separation , 2000 .

[13]  Kenji Suyama,et al.  A robust technique for sound source localization in consideration of room capacity , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[14]  Juan K. Lin,et al.  Feature extraction approach to blind source separation , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[15]  Deniz Erdogmus,et al.  Underdetermined blind source separation in a time-varying environment , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Justinian P. Rosca,et al.  REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION , 2001 .

[17]  O. Grellier NON LINEAR INVERSION OF UNDERDETERMINED MIXTURES Pierre COMON and Olivier GRELLIER , 1999 .

[18]  Scott Rickard,et al.  The In uence of Windowing on Time Delay Estimates , 2001 .

[19]  Ingrid Daubechies,et al.  Ten Lectures on Wavelets , 1992 .

[20]  Yutaka Kaneda,et al.  Sound source segregation based on estimating incident angle of each frequency component of input signals acquired by multiple microphones , 2001 .

[21]  M. Hulle Clustering approach to square and non-square blind source separation , 1999 .

[22]  Özgür Yilmaz,et al.  Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[23]  Özgür Yilmaz,et al.  On the approximate W-disjoint orthogonality of speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.