The Ideal Interaural Parameter Mask: A bound on binaural separation systems

We introduce the Ideal Interaural Parameter Mask as an upper bound on the performance of mask-based source separation algorithms that are based on the differences between signals from two microphones or ears. With two additions to our Model-based EM Source Separation and Localization system, its performance approaches that of the IIPM upper bound to within 0.9 dB. These additions battle the effects of reverberation by absorbing reverberant energy and by forcing the ILD estimate to be larger than it might otherwise be. An oracle reliability measure was also added, in the hope that estimating parameters from more reliable regions of the spectrogram would improve separation, but it was not consistently useful.

[1]  John W. Fisher,et al.  Using Sample-based Representations Under Communications Constraints , 2004 .

[2]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[3]  Daniel P. W. Ellis,et al.  EM Localization and Separation using Interaural Level and Phase Cues , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[4]  C. Faller,et al.  Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[5]  W. Hartmann,et al.  Binaural coherence in rooms , 2005 .

[6]  Guy J. Brown,et al.  A Classification-based Cocktail-party Processor , 2003, NIPS.

[7]  DeLiang Wang,et al.  On the optimality of ideal binary time-frequency masks , 2009, Speech Commun..

[8]  Trevor Darrell,et al.  Learning a Precedence Effect-Like Weighting Function for the Generalized Cross-Correlation Framework , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Rémi Gribonval,et al.  Oracle estimators for the benchmarking of source separation algorithms , 2007, Signal Process..

[10]  Barbara G Shinn-Cunningham,et al.  Localizing nearby sound sources in a classroom: binaural room impulse responses. , 2005, The Journal of the Acoustical Society of America.

[11]  Barbara G. Shinn-Cunningham,et al.  Effects of pitch and spatial separation on selective attention in anechoic and reverberant environments , 2008 .

[12]  Guy J. Brown,et al.  Mask estimation for missing data speech recognition based on statistics of binaural interaction , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Shigeki Sagayama,et al.  Sparseness-Based 2CH BSS using the EM Algorithm in Reverberant Environment , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[14]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[15]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[16]  DeLiang Wang,et al.  On the optimality of ideal binary time-frequency masks , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Barbara G. Shinn-Cunningham,et al.  Effect of source location and listener location on ILD cues in a reverberant room , 2004 .

[18]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .