Two-stage audio-visual speech dereverberation and separation based on models of the interaural spatial cues and spatial covariance

This work presents a two-stage speech source separation algorithm based on combined models of interaural cues and spatial covariance which utilize knowledge of the locations of the sources estimated through video. In the first pre-processing stage the late reverberant speech components are suppressed by a spectral subtraction rule to dereverberate the observed mixture. In the second stage, the binaural spatial parameters, the interaural phase difference and the interaural level difference, and the spatial covariance are modeled in the short-time Fourier transform (STFT) domain to classify individual time-frequency (TF) units to each source. The parameters of these probabilistic models and the TF regions assigned to each source are updated with the expectation-maximization (EM) algorithm. The algorithm generates TF masks that are used to reconstruct the individual speech sources. Objective results, in terms of the signal-to-distortion ratio (SDR) and the perceptual evaluation of speech quality (PESQ), confirm that the proposed multimodal method with pre-processing is a promising approach for source separation in highly reverberant rooms.

[1]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[3]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[4]  Rémi Gribonval,et al.  Under-determined convolutive blind source separation using spatial covariance models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Christian Jutten,et al.  Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  J.-M. Boucher,et al.  A New Method Based on Spectral Subtraction for Speech Dereverberation , 2001 .

[7]  Anthony J Watkins,et al.  Perceptual compensation for effects of reverberation in speech identification. , 2005, The Journal of the Acoustical Society of America.

[8]  Thomas Esch,et al.  Efficient musical noise suppression for speech enhancement system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[10]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Thomas Esch,et al.  Model-Based Dereverberation Preserving Binaural Cues , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Syed Mohsen Naqvi,et al.  MCMC-PF Based Multiple Head Tracking in a Room Environment , 2012 .

[13]  Muhammad Salman Khan,et al.  Multimodal (audio-visual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking , 2012, IET Signal Process..

[14]  Trevor Darrell,et al.  Audio-video array source separation for perceptual user interfaces , 2001, PUI '01.

[15]  John Mourjopoulos,et al.  Binaural extension and performance of single-channel spectral subtraction dereverberation algorithms , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Christian Jutten,et al.  Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli , 2002, EURASIP J. Adv. Signal Process..

[17]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Mohan M. Trivedi,et al.  Source localization in reverberant environments: modeling and statistical analysis , 2003, IEEE Trans. Speech Audio Process..

[19]  Pierre Vandergheynst,et al.  Blind Audiovisual Source Separation Based on Sparse Redundant Representations , 2010, IEEE Transactions on Multimedia.

[20]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Miao Yu,et al.  A Multimodal Approach to Blind Source Separation of Moving Sources , 2010, IEEE Journal of Selected Topics in Signal Processing.

[22]  Roger Y. Tsai,et al.  A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses , 1987, IEEE J. Robotics Autom..

[23]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[24]  Muhammad Salman Khan,et al.  Multimodal blind source separation with a circular microphone array and robust beamforming , 2011, 2011 19th European Signal Processing Conference.

[25]  Rémi Gribonval,et al.  Spatial covariance models for under-determined reverberant audio source separation , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[26]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[27]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.