A robust method to count, locate and separate audio sources in a multichannel underdetermined mixture

We propose a method to count and estimate the mixing directions and the sources in an underdetermined multichannel mixture. Like DUET-type methods, the approach is based on the hypothesis that the sources have time-frequency representations with limited overlap. However, instead of assuming essentially disjoint representations, we only assume that, in the neighbourhood of some time-frequency points, only one source contributes to the mixture: such time-frequency points can provide robust local estimates of the corresponding source direction. At the core of our contribution is a local confidence measure --inspired by the work of Deville on TIFROM-- which detect the time-frequency regions where such a robust information is available. A clustering algorithm called DEMIX is proposed to merge the information from all time-frequency regions according to their confidence level. Two variants are proposed to treat instantaneous and anechoic mixtures. In the latter case, to overcome the intrinsic ambiguities of phase unwrapping as met with DUET, we propose a technique similar to GCC-PHAT to estimate time-delay parameters from phase differences between time-frequency representations of different channels. The resulting method is shown to be robust in conditions where all DUET-like comparable methods fail: a) when time-delays largely exceed one sample; b) when the source directions are very close. As an example, experiments show that, in more than 65% of the tested stereophonic mixtures of six speech sources, DEMIX-Anechoic correctly estimates the number of sources and outperforms DUET in the accuracy, providing a distance error 10 times lower.

[1]  Giuseppe Patanè,et al.  The enhanced LBG algorithm , 2001, Neural Networks.

[2]  Rémi Gribonval,et al.  A Robust Method to Count and Locate Audio Sources in a Stereophonic Linear Instantaneous Mixture , 2006, ICA.

[3]  W. Härdle,et al.  Applied Multivariate Statistical Analysis , 2003 .

[4]  Sergios Theodoridis,et al.  Pattern Recognition , 1998, IEEE Trans. Neural Networks.

[5]  Özgür Yõlmaz,et al.  Blind Separation of Speech Mixtures via , 2004 .

[6]  D. R. Campbell,et al.  A MATLAB Simulation of “ Shoebox ” Room Acoustics for use in Research and Teaching , 2022 .

[7]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[8]  Rémi Gribonval,et al.  A Robust Method to Count and Locate Audio Sources in a Stereophonic Linear Anechoic Mixture , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[9]  R. Gribonval,et al.  Proposals for Performance Measurement in Source Separation , 2003 .

[10]  Yannick Deville,et al.  Blind separation of dependent sources using the "time-frequency ratio of mixtures" approach , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[11]  Cédric Févotte,et al.  Two contributions to blind source separation using time-frequency distributions , 2004, IEEE Signal Processing Letters.

[12]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[13]  Michael Zibulevsky,et al.  Underdetermined blind source separation using sparse representations , 2001, Signal Process..

[14]  E. Hoyer,et al.  The zoom FFT using complex modulation , 1977 .

[15]  Barak A. Pearlmutter,et al.  Survey of sparse and non‐sparse methods in source separation , 2005, Int. J. Imaging Syst. Technol..