Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures

Looking at the speaker's face can be useful to better hear a speech signal in noisy environment and extract it from competing sources before identification. This suggests that the visual signals of speech (movements of visible articulators) could be used in speech enhancement or extraction systems. In this paper, we present a novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques. This algorithm is applied to the difficult and realistic case of convolutive mixtures. The algorithm mainly works in the frequency (transform) domain, where the convolutive mixture becomes an additive mixture for each frequency channel. Frequency by frequency separation is made by an audio BSS algorithm. The audio and visual informations are modeled by a newly proposed statistical model. This model is then used to solve the standard source permutation and scale factor ambiguities encountered for each frequency after the audio blind separation stage. The proposed method is shown to be efficient in the case of 2 times 2 convolutive mixtures and offers promising perspectives for extracting a particular speech source of interest from complex mixtures

[1]  Christian Jutten,et al.  Detection de grandeurs primitives dans un message composite par une architecture de calcul neuromime , 1985 .

[2]  Dinh Tuan Pham,et al.  Joint Approximate Diagonalization of Positive Definite Hermitian Matrices , 2000, SIAM J. Matrix Anal. Appl..

[3]  Laurent Girin,et al.  Speech signals separation: a new approach exploiting the coherence of audio and visual speech , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[4]  Joos Vandewalle,et al.  Fetal electrocardiogram extraction by blind source subspace separation , 2000, IEEE Transactions on Biomedical Engineering.

[5]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[6]  Lynne E. Bernstein,et al.  Auditory speech detection in noise enhanced by lipreading , 2004, Speech Commun..

[7]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[8]  J L Schwartz,et al.  Audio-visual enhancement of speech in noise. , 2001, The Journal of the Acoustical Society of America.

[9]  Frédéric Berthommier,et al.  Audio-visual scene analysis: evidence for a "very-early" integration process in audio-visual speech perception , 2002, INTERSPEECH.

[10]  K. Nakayama,et al.  Analysis of signal separation and distortion analysis in feedforward blind source separation for convolutive mixture , 2004, The 2004 47th Midwest Symposium on Circuits and Systems, 2004. MWSCAS '04..

[11]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[12]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[13]  Adriana Dapena,et al.  A novel frequency domain approach for separating convolutive mixtures of temporally-white signals , 2003, Digit. Signal Process..

[14]  Christian Jutten,et al.  Blind source separation for convolutive mixtures , 1995, Signal Process..

[15]  Christian Benoît,et al.  Read my lips... and my jaw! how intelligible are the components of a speaker's face? , 1995, EUROSPEECH.

[16]  Jeesun Kim,et al.  Investigating the audio-visual speech detection advantage , 2004, Speech Commun..

[17]  L. Lathauwer,et al.  Fetal electrocardiogram extraction by source subspace separation , 1995 .

[18]  Elie Laurent Benaroya Séparation de plusieurs sources sonores avec un seul microphone , 2003 .

[19]  James L. Massey,et al.  Proper complex random processes with applications to information theory , 1993, IEEE Trans. Inf. Theory.

[20]  Lucas C. Parra,et al.  Convolutive blind separation of non-stationary sources , 2000, IEEE Trans. Speech Audio Process..

[21]  Christine Serviere,et al.  BLIND SEPARATION OF CONVOLUTIVE AUDIO MIXTURES USING NONSTATIONARITY , 2003 .

[22]  Bernard C. Picinbono,et al.  Second-order complex random vectors and normal distributions , 1996, IEEE Trans. Signal Process..

[23]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[24]  Lynne E. Bernstein,et al.  For speech perception by humans or machines, three senses are better than one , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[25]  Gérard Bailly,et al.  Creating and controlling video-realistic talking heads , 2001, AVSP.

[26]  Chalapathy Neti,et al.  Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization) , 2002, Sensor Array and Multichannel Signal Processing Workshop Proceedings, 2002.

[27]  Christian Jutten,et al.  Log-Rayleigh Distribution: A Simple and Efficient Statistical Representation of Log-Spectral Coefficients , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[29]  Richard M. Dansereau,et al.  Co-channel audiovisual speech separation using spectral matching constraints , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Mohamed Sahmoudi,et al.  Blind Separation of Convolutive Mixtures using Nonstationarity and Fractional Lower Order Statistics (FLOS): Application to Audio Signals , 2006, Fourth IEEE Workshop on Sensor Array and Multichannel Processing, 2006..

[31]  Eric D. Petajan Automatic lipreading to enhance speech recognition , 1984 .

[32]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[33]  Christian Benoît,et al.  Analysis-Synthesis and Intelligibility of a Talking Face , 1997 .

[34]  Christian Jutten,et al.  Developing an audio-visual speech source separation algorithm , 2004, Speech Commun..

[35]  Hichem Snoussi,et al.  Penalized maximum likelihood for multivariate Gaussian mixture , 2002 .

[36]  Jean-Louis Lacoume,et al.  Blind separation of wide-band sources in the frequency domain , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[37]  P F Seitz,et al.  The use of visible speech cues for improving auditory detection of spoken sentences. , 2000, The Journal of the Acoustical Society of America.

[38]  Alexis Bosseler,et al.  Read my lips , 2006, Autism : the international journal of research and practice.

[39]  Christian Jutten,et al.  Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli , 2002, EURASIP J. Adv. Signal Process..

[40]  Asoke K. Nandi,et al.  Noninvasive fetal electrocardiogram extraction: blind separation versus adaptive noise cancellation , 2001, IEEE Transactions on Biomedical Engineering.

[41]  Chalapathy Neti,et al.  Noisy audio feature enhancement using audio-visual speech data , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  Christian Jutten,et al.  Using audiovisual speech processing to improve the robustness of the separation of convolutive speech mixtures , 2004, IEEE 6th Workshop on Multimedia Signal Processing, 2004..

[43]  Andreas Ziehe,et al.  An approach to blind source separation based on temporal structure of speech signals , 2001, Neurocomputing.

[44]  Volker Tresp,et al.  Averaging, maximum penalized likelihood and Bayesian estimation for improving Gaussian mixture probability density estimates , 1998, IEEE Trans. Neural Networks.

[45]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[46]  Christian Jutten,et al.  Solving the indeterminations of blind source separation of convolutive speech mixtures , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[47]  Shoji Makino,et al.  Blind source separation of convolutive mixtures , 2006, SPIE Defense + Commercial Sensing.

[48]  Christian Jutten,et al.  Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture , 1991, Signal Process..

[49]  T. Ens,et al.  Blind signal separation : statistical principles , 1998 .

[50]  J Robert-Ribes,et al.  Complementarity and synergy in bimodal speech: auditory, visual, and audio-visual identification of French oral vowels in noise. , 1998, The Journal of the Acoustical Society of America.

[51]  J. Cardoso,et al.  Blind beamforming for non-gaussian signals , 1993 .

[52]  Luis Castedo,et al.  SEPARATION OF CONVOLUTIVE MIXTURES OF TEMPORALLY-WHITE SIGNALS: A NOVEL FREQUENCY-DOMAIN APPROACH , 2001 .

[53]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[54]  Thomas S. Huang,et al.  Bayesian separation of audio-visual speech sources , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.