Computational auditory scene analysis by using statistics of high-dimensional speech dynamics and sound source direction

A main task for computational auditory scene analysis (CASA) is to separate several concurrent speech sources. From psychoacoustics it is known that common onsets, common amplitude modulation and sound source direction are among the important cues which allow the separation for the human auditory system. A new algorithm is presented here, that performs statistical estimation of different sources by a state-space approach which integrates temporal and frequency-specific features of speech. It is based on a Sequential Monte Carlo (SMC) scheme and tracks magnitude spectra and direction on a frame-by-frame basis. First results for estimating sound source direction and separating the envelopes of two voices are shown. The results indicate that the algorithm is able to localize two superimposed sound sources in a time scale of 50 ms. This is achieved by integrating measured high-dimensional statistics of speech. Also, the algorithm is able to track the short-time envelope and the short-time magnitude spectra of both voices on a time scale of 10 -4 0 ms. The algorithm presented in this paper is developed for but not restricted to use in binaural hearing aid applications, as it is based on two head-mounted microphone signals as input. It is conceptionally able to separate more than two voices and integrate additional cues.