Using the bi-modality of speech for convolutive frequency domain blind source separation

The problem of blind source separation for the case of convolutive mixtures of speech is considered. A novel algorithm is proposed that exploits the bi-modality of speech. This is achieved by incorporating joint audio-visual features into an existing BSS algorithm for the purpose of improving the convergence rate of the source separation algorithm. The increase in the rate of convergence when using a joint audio-visual model compared to using raw audio data (i.e. no model) is shown with simulations. The difference between using time varying (HMM) and stationary (GMM) statistical models to model the joint audio-visual features is also considered.

[1]  Saeid Sanei,et al.  Video assisted speech source separation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Christian Jutten,et al.  Solving the indeterminations of blind source separation of convolutive speech mixtures , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[4]  Saeid Sanei,et al.  Penalty function-based joint diagonalization approach for convolutive blind separation of nonstationary sources , 2005, IEEE Transactions on Signal Processing.

[5]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[6]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  S. Sanei,et al.  Study of Video Assisted BSS for Convolutive Mixtures , 2006, 2006 IEEE 12th Digital Signal Processing Workshop & 4th IEEE Signal Processing Education Workshop.

[8]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .