Sound source separation algorithm using phase difference and angle distribution modeling near the target

In this paper we present a novel two-microphone sound source separation algorithm, which selects the signal from the target direction while suppressing signals from other directions. In this algorithm, which is referred to as Power Angle Information Near Target (PAINT), we first calculate phase difference for each time-frequency bin. From the phase difference, the angle of a sound source is estimated. For each frame, we represent the source angle distribution near the expected target location as a mixture of a Gaussian and a uniform distributions and obtain binary masks using hypothesis testing. Continuous masks are calculated from the binary masks using the Channel Weighting (CW) technique, and processed speech is synthesized using IFFT and the OverLap-Add (OLA) method. We demonstrate that the algorithm described in this paper shows better speech recognition accuracy compared to conventional approaches and our previous approaches.

[1]  Richard M. Stern,et al.  Physiologically-motivated synchrony-based processing for robust automatic speech recognition , 2006, INTERSPEECH.

[2]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Richard M. Stern,et al.  Robust speech recognition using a Small Power Boosting algorithm , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[4]  Hynek Hermansky,et al.  Robust spectro-temporal features based on autoregressive models of Hilbert envelopes , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Richard M. Stern,et al.  Robust speech recognition using temporal masking and thresholding algorithm , 2014, INTERSPEECH.

[6]  Richard M. Stern,et al.  Nonlinear enhancement of onset for robust speech recognition , 2010, INTERSPEECH.

[7]  Richard M. Stern,et al.  Binaural sound source separation motivated by auditory processing , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Chanwoo Kim,et al.  Robust DTW-based recognition algorithm for hand-held consumer devices , 2005, IEEE Transactions on Consumer Electronics.

[9]  R. Viswanathan,et al.  An introduction to statistical signal processing with applications , 1979 .

[10]  Hynek Hermansky,et al.  Spectral entropy based feature for robust ASR , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Richard M. Stern,et al.  Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain , 2009, INTERSPEECH.

[12]  Richard M. Stern,et al.  Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14]  Richard M. Stern,et al.  Two-microphone source separation algorithm based on statistical modeling of angle distributions , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Hermann Ney,et al.  Histogram based normalization in the acoustic feature space , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[16]  Richard M. Stern,et al.  Power function-based power distribution normalization algorithm for robust speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  Climent Nadeu,et al.  On Real-Time Mean-and-Variance Normalization of Speech Recognition Features , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[18]  Martin Heckmann,et al.  A hierarchical framework for spectro-temporal feature extraction , 2011, Speech Commun..

[19]  Richard M. Stern,et al.  Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction , 2009, INTERSPEECH.

[20]  Geoffrey W. Hill,et al.  Algorithm 518: Incomplete Bessel Function I0. The Von Mises Distribution [S14] , 1977, TOMS.

[21]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[22]  Richard M. Stern,et al.  Automatic selection of thresholds for signal separation algorithms based on interaural delay , 2010, INTERSPEECH.

[23]  Richard M. Stern,et al.  COMPENSATION FOR ENVIRONMENTAL DEGRADATION IN AUTOMATIC SPEECH RECOGNITION , 1999 .