Model-Based Expectation-Maximization Source Separation and Localization

This paper describes a system, referred to as model-based expectation-maximization source separation and localization (MESSL), for separating and localizing multiple sound sources from an underdetermined reverberant two-channel recording. By clustering individual spectrogram points based on their interaural phase and level differences, MESSL generates masks that can be used to isolate individual sound sources. We first describe a probabilistic model of interaural parameters that can be evaluated at individual spectrogram points. By creating a mixture of these models over sources and delays, the multi-source localization problem is reduced to a collection of single source problems. We derive an expectation-maximization algorithm for computing the maximum-likelihood parameters of this mixture model, and show that these parameters correspond well with interaural parameters measured in isolation. As a byproduct of fitting this mixture model, the algorithm creates probabilistic spectrogram masks that can be used for source separation. In simulated anechoic and reverberant environments, separations using MESSL produced on average a signal-to-distortion ratio 1.6 dB greater and perceptual evaluation of speech quality (PESQ) results 0.27 mean opinion score units greater than four comparable algorithms.

[1]  A. Zeiberg,et al.  Lateralization of complex binaural stimuli: a weighted-image model. , 1988, The Journal of the Acoustical Society of America.

[2]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[3]  Diego H. Milone,et al.  Perceptual evaluation of blind source separation for robust speech recognition , 2008, Signal Process..

[4]  Daniel P. W. Ellis,et al.  EM Localization and Separation using Interaural Level and Phase Cues , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[5]  Barbara G Shinn-Cunningham,et al.  Localizing nearby sound sources in a classroom: binaural room impulse responses. , 2005, The Journal of the Acoustical Society of America.

[6]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[7]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[8]  DeLiang Wang,et al.  Binaural Sound Localization , 2006 .

[9]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[10]  W. Koenig,et al.  Subjective Effects in Binaural Hearing , 1950 .

[11]  Walter Kellermann,et al.  A generalization of blind source separation algorithms for convolutive mixtures based on second-order statistics , 2005, IEEE Transactions on Speech and Audio Processing.

[12]  Daniel P. W. Ellis,et al.  An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments , 2006, NIPS.

[13]  W. Marsden I and J , 2012 .

[14]  Hiroshi Sawada,et al.  A Two-Stage Frequency-Domain Blind Source Separation Method for Underdetermined Convolutive Mixtures , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[15]  Volker Hohmann,et al.  Sound source localization in real sound fields based on empirical statistics of interaural parameters. , 2006, The Journal of the Acoustical Society of America.

[16]  Özgür Yilmaz,et al.  Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[17]  Guy J. Brown,et al.  Speech Separation Based on The Statistics of Binaural Auditory Features , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[18]  Daniel P. W. Ellis,et al.  Source separation based on binaural cues and source model constraints , 2008, INTERSPEECH.

[19]  B C Wheeler,et al.  Localization of multiple sound sources with two microphones. , 2000, The Journal of the Acoustical Society of America.

[20]  H. Colburn,et al.  Models of Sound Localization , 2005 .

[21]  Guy J. Brown,et al.  A Classification-based Cocktail-party Processor , 2003, NIPS.

[22]  L A JEFFRESS,et al.  A place theory of sound localization. , 1948, Journal of comparative and physiological psychology.

[23]  Parham Aarabi,et al.  Self-localizing dynamic microphone arrays , 2002 .

[24]  C. Avendano,et al.  The CIPIC HRTF database , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[26]  A. Kohlrausch,et al.  Binaural processing model based on contralateral inhibition. I. Model structure. , 2001, The Journal of the Acoustical Society of America.

[27]  Sylvain Marchand,et al.  A Source Localization/Separation/Respatialization System Based on Unsupervised Classification of Interaural Cues , 2006 .

[28]  Harald Viste,et al.  On the Use of Spatial Cues to Improve Binaural Source Separation , 2003 .

[29]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[30]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Aapo Hyvärinen,et al.  Survey on Independent Component Analysis , 1999 .

[32]  N. Durlach Equalization and Cancellation Theory of Binaural Masking‐Level Differences , 1963 .

[33]  Guy J. Brown,et al.  Mask estimation for missing data speech recognition based on statistics of binaural interaction , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Daniel P. W. Ellis,et al.  A probability model for interaural phase difference , 2006, SAPA@INTERSPEECH.

[35]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..