Robust digit recognition using phase-dependent time-frequency masking

A technique using the time-frequency phase information of two microphones is proposed to estimate an ideal time-frequency mask using time-delay-of-arrival (TDOA) of the signal of interest. At a signal-to-noise ratio (SNR) of 0 dB, the proposed technique using two microphones achieves a digit recognition rate (average over 5 speakers, each speaking 20-30 digits) of 71%. In contrast, delay-and-sum beamforming only achieves a 40% recognition rate with two microphones and 60% with four microphones. Superdirective beamforming achieves a 44% recognition rate with two microphones and 65% with four microphones.

[1]  Karl-Dirk Kammeyer,et al.  MULTI-MICROPHONE NOISE REDUCTION TECHNIQUES FOR HANDS-FR EE SPEECH RECOGNITION -A COMPARATIVE STUDY- , 1999 .

[2]  E. Oja,et al.  Independent Component Analysis , 2013 .

[3]  Don H. Johnson,et al.  Array Signal Processing: Concepts and Techniques , 1993 .

[4]  Sridha Sridharan,et al.  Robust speaker recognition using microphone arrays , 2001, Odyssey.

[5]  Bernard Widrow,et al.  Adaptive Signal Processing , 1985 .

[6]  Terrence J. Sejnowski,et al.  Blind separation and blind deconvolution: an information-theoretic approach , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[7]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[8]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[9]  Michael S. Brandstein,et al.  A robust method for speech signal time-delay estimation in reverberant rooms , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Guangji Shi,et al.  Multi-channel time-frequency data fusion , 2002, Proceedings of the Fifth International Conference on Information Fusion. FUSION 2002. (IEEE Cat.No.02EX5997).