Binaural Detection, Localization, and Segregation in Reverberant Environments Based on Joint Pitch and Azimuth Cues

We propose an approach to binaural detection, localization and segregation of speech based on pitch and azimuth cues. We formulate the problem as a search through a multisource state space across time, where each multisource state encodes the number of active sources, and the azimuth and pitch of each active source. A set of multilayer perceptrons are trained to assign time-frequency units to one of the active sources in each multisource state based jointly on observed pitch and azimuth cues. We develop a novel hidden Markov model framework to estimate the most probable path through the multisource state space. An estimated state path encodes a solution to the detection, localization, pitch estimation and simultaneous organization problems. Segregation is then achieved with an azimuth-based sequential organization stage. We demonstrate that the proposed framework improves segregation relative to several two-microphone comparison systems that are based solely on azimuth cues. Performance gains are consistent across a variety of reverberant conditions.

[1]  Jesper Jensen,et al.  On Optimal Multichannel Mean-Squared Error Estimators for Speech Enhancement , 2009, IEEE Signal Processing Letters.

[2]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[3]  Joerg Bitzer,et al.  Post-Filtering Techniques , 2001, Microphone Arrays.

[4]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[6]  Charles Darwin,et al.  Spatial Hearing and Perceiving Sources , 2008 .

[7]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[8]  J. S. Bradley Predictors of speech intelligibility in rooms. , 1986, The Journal of the Acoustical Society of America.

[9]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[10]  Volker Hohmann,et al.  Combined Estimation of Spectral Envelopes and Sound Source Direction of Concurrent Voices by Multidimensional Statistical Filtering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Søren Holdt Jensen,et al.  Joint DOA and fundamental frequency estimation methods based on 2-D filtering , 2010, 2010 18th European Signal Processing Conference.

[12]  B. Shinn-Cunningham,et al.  Influences of spatial cues on grouping and understanding sound , 2005 .

[13]  Richard F. Lyon A computational model of binaural localization and separation , 1983, ICASSP.

[14]  Yang Lu,et al.  An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[15]  Nicoleta Roman,et al.  Intelligibility of reverberant noisy speech with ideal binary masking. , 2011, The Journal of the Acoustical Society of America.

[16]  Walter Kellermann,et al.  Blind Source Separation for Convolutive Mixtures: A Unified Treatment , 2004 .

[17]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[19]  Franz Pernkopf,et al.  Joint Position-Pitch Tracking for 2-Channel Audio , 2007, 2007 International Workshop on Content-Based Multimedia Indexing.

[20]  Daniel P. W. Ellis,et al.  Combining localization cues and source model constraints for binaural source separation , 2011, Speech Commun..

[21]  DeLiang Wang,et al.  Speech intelligibility in background noise with ideal binary time-frequency masking. , 2009, The Journal of the Acoustical Society of America.

[22]  DeLiang Wang,et al.  A Supervised Learning Approach to Monaural Segregation of Reverberant Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  DeLiang Wang,et al.  Binaural Localization of Multiple Sources in Reverberant and Noisy Environments , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  W. G. Gardner,et al.  HRTF measurements of a KEMAR , 1995 .

[25]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[26]  DeLiang Wang,et al.  HMM-Based Multipitch Tracking for Noisy and Reverberant Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Guy J. Brown,et al.  Binaural Speech Separation Using Recurrent Timing Neural Networks for Joint F0-Localisation Estimation , 2007, MLMI.

[28]  Marc Moonen,et al.  Joint DOA and multi-pitch estimation based on subspace techniques , 2012, EURASIP J. Adv. Signal Process..

[29]  Tomohiro Nakatani,et al.  Localization by harmonic structure and its application to harmonic sound stream segregation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[30]  Ning Ma,et al.  Binaural Cues for Fragment-Based Speech Recognition in Reverberant Multisource Environments , 2011, INTERSPEECH.

[31]  Birger Kollmeier,et al.  A simple architecture for using multiple cues in sound separation , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[32]  WangDeLiang,et al.  Binaural Localization of Multiple Sources in Reverberant and Noisy Environments , 2012 .

[33]  DeLiang Wang,et al.  Sequential Organization of Speech in Reverberant Environments by Integrating Monaural Grouping and Binaural Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Jacob Benesty,et al.  On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Tim Brookes,et al.  Dynamic Precedence Effect Modeling for Source Separation in Reverberant Environments , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  O. L. Frost,et al.  An algorithm for linearly constrained adaptive array processing , 1972 .

[38]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[39]  P. Boersma ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[40]  P. N. Denbigh,et al.  A sound segregation algorithm for reverberant conditions , 2001, Speech Commun..