Relaxing the WDO Assumption in Blind Extraction of Speakers from Speech Mixtures

The time-frequency masking approach in blind speech extraction consists of two main steps: feature clustering in a space spanned over delay-time and attenuation rate, and spectrogram masking in order to reconstruct the sources. Usually a binary mask is generated under the strong W-disjoint orthogonal (WDO) assumption (disjoint orthogonal representations in the frequency domain). In practice, this assumption is most often violated leading to weak quality of reconstructed sources. In this paper we propose the WDO to be relaxed by allowing some frequency bins to be shared by both sources. As we detect instantaneous fundamental frequencies the mask creation is supported by exploring a harmonic structure of speech. The proposed method is proved to be effective and reliable in experiments with both simulated and real acquired mixtures. Keywords—blind source extraction, harmonic frequencies, histogram clustering, spectrogram analysis, speech reconstruction, time-frequency masking, W-disjoint orthogonal.

[1]  Rémi Gribonval,et al.  A robust method to count, locate and separate audio sources in a multichannel underdetermined mixture , 2008 .

[2]  Yutaka Kaneda,et al.  Sound source segregation based on estimating incident angle of each frequency component of input signals acquired by multiple microphones , 2001 .

[3]  Erkki Oja,et al.  Independent Component Analysis , 2001 .

[4]  Scott Rickard,et al.  The DUET Blind Source Separation Algorithm , 2007, Blind Speech Separation.

[5]  Yuanqing Li,et al.  K-hyperline clustering learning for sparse component analysis , 2009, Signal Process..

[6]  Rémi Gribonval,et al.  A Robust Method to Count and Locate Audio Sources in a Multichannel Underdetermined Mixture , 2010, IEEE Transactions on Signal Processing.

[7]  Jae S. Lim,et al.  Two-Dimensional Signal and Image Processing , 1989 .

[8]  Yannick Deville,et al.  A time-frequency blind signal separation method applicable to underdetermined mixtures of dependent sources , 2005, Signal Process..

[9]  C. Serviere,et al.  Blind source separation of convolutive mixtures , 1996, Proceedings of 8th Workshop on Statistical Signal and Array Processing.

[10]  Hiroshi Sawada,et al.  Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors , 2007, Signal Process..

[11]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[12]  Nozomu Hamada,et al.  Separation of speech mixture by time-frequency masking utilizing sound harmonics , 2009 .

[13]  Andrzej Cichocki,et al.  Adaptive blind signal and image processing , 2002 .

[14]  Tetsunori Kobayashi,et al.  ASJ continuous speech corpus for research , 1992 .

[15]  Te-Won Lee,et al.  Blind Speech Separation , 2007, Blind Speech Separation.