Joint Mixing Vector and Binaural Model Based Stereo Source Separation

In this paper the mixing vector (MV) in the statistical mixing model is compared to the binaural cues represented by interaural level and phase differences (ILD and IPD). It is shown that the MV distributions are quite distinct while binaural models overlap when the sources are close to each other. On the other hand, the binaural cues are more robust to high reverberation than MV models. According to this complementary behavior we introduce a new robust algorithm for stereo speech separation which considers both additive and convolutive noise signals to model the MV and binaural cues in parallel and estimate probabilistic time-frequency masks. The contribution of each cue to the final decision is also adjusted by weighting the log-likelihoods of the cues empirically. Furthermore, the permutation problem of the frequency domain blind source separation (BSS) is addressed by initializing the MVs based on binaural cues. Experiments are performed systematically on determined and underdetermined speech mixtures in five rooms with various acoustic properties including anechoic, highly reverberant, and spatially-diffuse noise conditions. The results in terms of signal-to-distortion-ratio (SDR) confirm the benefits of integrating the MV and binaural cues, as compared with two state-of-the-art baseline algorithms which only use MV or the binaural cues.

[1]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[2]  L. Rayleigh,et al.  XII. On our perception of sound direction , 1907 .

[3]  Ramjee Prasad,et al.  Convex Combination of Multiple Statistical Models With Application to VAD , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Daniel P. W. Ellis,et al.  The Ideal Interaural Parameter Mask: A bound on binaural separation systems , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[5]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[6]  Diego H. Milone,et al.  Perceptual evaluation of blind source separation for robust speech recognition , 2008, Signal Process..

[7]  James V. Stone Independent Component Analysis: A Tutorial Introduction , 2007 .

[8]  Barbara G Shinn-Cunningham,et al.  Localizing nearby sound sources in a classroom: binaural room impulse responses. , 2005, The Journal of the Acoustical Society of America.

[9]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[10]  Barak A. Pearlmutter,et al.  The LOST Algorithm: Finding Lines and Separating Speech Mixtures , 2008, EURASIP J. Adv. Signal Process..

[11]  H S Colburn,et al.  Speech intelligibility and localization in a multi-source environment. , 1999, The Journal of the Acoustical Society of America.

[12]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[13]  M A Lord Rayleigh,et al.  On Our Perception of the Direotion of a Source of Sound , 1875 .

[14]  Hiroshi Sawada,et al.  Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Geert Dhaene,et al.  Probability Theory and Statistical Inference: Econometric Modeling With Observational Data , 2001 .

[17]  Tetsuya Ogata,et al.  Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18]  Léopold Simar,et al.  Canonical Correlation Analysis , 2015 .

[19]  Bryan Pardo,et al.  Improving separation of harmonic sources with iterative estimation of spatial cues , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[20]  Parham Aarabi,et al.  Self-localizing dynamic microphone arrays , 2002 .

[21]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[22]  DeLiang Wang,et al.  Combining monaural and binaural evidence for reverberant speech segregation , 2010, INTERSPEECH.

[23]  Atiyeh Alinaghi,et al.  Integrating binaural cues and blind source separation method for separating reverberant speech mixtures , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Emmanuel Vincent,et al.  The 2008 Signal Separation Evaluation Campaign: A Community-Based Approach to Large-Scale Evaluation , 2009, ICA.

[25]  Tim Brookes,et al.  Dynamic Precedence Effect Modeling for Source Separation in Reverberant Environments , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  William M. Hartmann,et al.  How we localize sound , 1999 .

[27]  Aris Spanos,et al.  Probability theory and statistical inference: econometric modelling with observational data , 1999 .

[28]  Christopher Hummersone,et al.  A Psychoacoustic Engineering Approach to Machine Sound Source Separation in Reverberant Environments , 2011 .

[29]  Daniel P. W. Ellis,et al.  A probability model for interaural phase difference , 2006, SAPA@INTERSPEECH.

[30]  Bryan Pardo,et al.  Using Pitch, Amplitude Modulation, and Spatial Cues for Separation of Harmonic Instruments from Stereo Music Recordings , 2007, EURASIP J. Adv. Signal Process..

[31]  Pierre Comon,et al.  Handbook of Blind Source Separation: Independent Component Analysis and Applications , 2010 .

[32]  Hiroshi Sawada,et al.  A Two-Stage Frequency-Domain Blind Source Separation Method for Underdetermined Convolutive Mixtures , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[33]  Atiyeh Alinaghi,et al.  Spatial and coherence cues based time-frequency masking for binaural reverberant speech separation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[36]  Yoshitaka Nakajima,et al.  Auditory Scene Analysis: The Perceptual Organization of Sound Albert S. Bregman , 1992 .

[37]  Hiroshi Sawada,et al.  A robust and precise method for solving the permutation problem of frequency-domain blind source separation , 2004, IEEE Transactions on Speech and Audio Processing.

[38]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[39]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[40]  G. Kramer Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .