A reverberation robust target speech detection method using dual-microphone in distant-talking scene

Abstract Speech signal processing with coherent interference in reverberant environment under distant-talking scene has always been a difficult problem, in which Target Speech Detection (TSD) plays a basic role. This paper proposes a reverberation robust TSD method, which is based on Beam-to-Reference Ratio (BRR) using a dual-microphone array, is proposed. At first, detection thresholds in Time–Frequency (T–F) domain are derived under free sound field assumption. A novel estimator, Direct-to-Reverberate Ratio (DRR), is introduced to enlarge the basic assumption to reverberant environment which is common in distant-talking scene. Then T–F thresholds of BRR are revised according to DRR. Meanwhile, innate weak point of compact array is studied due to spatial aliasing and a sidelobe suppression procedure is proposed to further eliminate the effect of the coherent interference. According to these techniques, a state-of-the-art full-band judgement is obtained by statistics of likelihood on each T–F block. Experimental results show that the proposed method performs robust in different reverberant environments with coherent interferences when target speech is from the a priori known direction-of-arrivals (DOA) in distant-talking scene.

[1]  Hyung Soon Kim,et al.  A robust target signal detector based on statistical models using binaural cross-similarity information , 2010, 2010 18th European Signal Processing Conference.

[2]  Hamid Sheikhzadeh,et al.  ETSI AMR-2 VAD: evaluation and ultra low-resource implementation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  Sven Nordholm,et al.  Statistical Voice Activity Detection Using Low-Variance Spectrum Estimation and an Adaptive Threshold , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  I. Cohen,et al.  Multichannel signal detection based on the transient beam-to-reference ratio , 2003, IEEE Signal Processing Letters.

[5]  Masakiyo Fujimoto,et al.  Two-Microphone Voice Activity Detection Based on the Homogeneity of the Direction of Arrival Estimates , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Wei Zhang,et al.  A soft voice activity detector based on a Laplacian-Gaussian model , 2003, IEEE Trans. Speech Audio Process..

[7]  Israel Cohen,et al.  Two-channel signal detection and speech enhancement based on the transient beam-to-reference ratio , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8]  Israel Cohen,et al.  Convolutive Transfer Function Generalized Sidelobe Canceler , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Israel Cohen,et al.  Microphone array post-filtering for non-stationary noise suppression , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Yanmeng Guo,et al.  A two-microphone based voice activity detection for distant-talking speech in wide range of direction of arrival , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Karl-Dirk Kammeyer,et al.  Theoretical noise reduction limits of the generalized sidelobe canceller (GSC) for speech enhancement , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[12]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[13]  Sanjit K. Mitra,et al.  Voice activity detection based on multiple statistical models , 2006, IEEE Transactions on Signal Processing.

[14]  Régine Le Bouquin-Jeannès,et al.  Study of a voice activity detector and its influence on a noise reduction system , 1995, Speech Commun..

[15]  Christophe Beaugeant,et al.  Blind estimation of the coherent-to-diffuse energy ratio from noisy speech signals , 2011, 2011 19th European Signal Processing Conference.

[16]  S. Gannot,et al.  Speech enhancement based on the general transfer function GSC and postfiltering , 2004, IEEE Trans. Speech Audio Process..

[17]  Israel Cohen,et al.  Relative transfer function identification using speech signals , 2004, IEEE Transactions on Speech and Audio Processing.

[18]  Ilyas Potamitis Estimation of speech presence probability in the field of microphone array , 2004, IEEE Signal Processing Letters.

[19]  I. Cohen,et al.  Noise estimation by minima controlled recursive averaging for robust speech enhancement , 2002, IEEE Signal Processing Letters.

[20]  Hamid Sheikhzadeh,et al.  ETSI AMR-2 VAD: evaluation and ultra low-resource implementation , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[21]  Emanuel A. P. Habets,et al.  New Insights Into the MVDR Beamformer in Room Acoustics , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[23]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[24]  Sven Nordholm,et al.  Robust microphone array using subband adaptive beamformer and spectral subtraction , 2002, The 8th International Conference on Communication Systems, 2002. ICCS 2002..

[25]  Javier Ramírez,et al.  A new adaptive long-term spectral estimation voice activity detector , 2003, INTERSPEECH.

[26]  E. Shlomot,et al.  ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications , 1997, IEEE Commun. Mag..