Spatial and coherence cues based time-frequency masking for binaural reverberant speech separation

Most of the binaural source separation algorithms only consider the dissimilarities between the recorded mixtures such as interaural phase and level differences (IPD, ILD) to classify and assign the time-frequency (T-F) regions of the mixture spectrograms to each source. However, in this paper we show that the coherence between the left and right recordings can provide extra information to label the T-F units from the sources. This also reduces the effect of reverberation which contains random reflections from different directions showing low correlation between the sensors. Our algorithm assigns the T-F regions into original sources based on weighted combination of IPD, ILD, the mixing vector models and the estimated interaural coherence (IC) between the left and right recordings. The binaural room impulse responses measured in four rooms with various acoustic conditions have been used to evaluate the performance of the proposed method which shows an average improvement of more than 2.23 dB in signal-to-distortion ratio (SDR) in room D with T60 = 0.89 s over the state-of-the-art algorithms.

[1]  Tim Brookes,et al.  Dynamic Precedence Effect Modeling for Source Separation in Reverberant Environments , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  L. Rayleigh,et al.  XII. On our perception of sound direction , 1907 .

[3]  H S Colburn,et al.  The precedence effect. , 1999, The Journal of the Acoustical Society of America.

[4]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Parham Aarabi,et al.  Self-localizing dynamic microphone arrays , 2002 .

[6]  Takuya Yoshioka,et al.  Blind Separation and Dereverberation of Speech Mixtures by Joint Optimization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  M A Lord Rayleigh,et al.  On Our Perception of the Direotion of a Source of Sound , 1875 .

[8]  C. Faller,et al.  Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[9]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Atiyeh Alinaghi,et al.  Integrating binaural cues and blind source separation method for separating reverberant speech mixtures , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Hiroshi Sawada,et al.  Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  F. Jacobsen,et al.  The coherence of reverberant sound fields. , 2000, The Journal of the Acoustical Society of America.

[13]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[14]  Keith D. Martin Echo suppression in a computational model of the precedence effect , 1997, Proceedings of 1997 Workshop on Applications of Signal Processing to Audio and Acoustics.

[15]  Thomas Esch,et al.  Model-Based Dereverberation Preserving Binaural Cues , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Tetsuya Ogata,et al.  Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Christopher Hummersone,et al.  A Psychoacoustic Engineering Approach to Machine Sound Source Separation in Reverberant Environments , 2011 .

[18]  H. Gaskell The precedence effect , 1983, Hearing Research.

[19]  Jont B. Allen,et al.  Multimicrophone signal‐processing technique to remove room reverberation from speech signals , 1977 .