Binaural localization of speech sources in 3-D using a composite feature vector of the HRTF

Binaural localization of speech sources in 3-D, using head-related transfer functions (HRTFs), always suffers elevation ambiguity due to the limited high frequency spectral information available at the receivers. This paper presents a method that overcomes this limitation by exploiting the interaural phase and magnitude features present in the HRTF. We (i) introduce a new feature vector that combines these two sets of features in a non-linear fashion, and (ii) propose a mechanism to extract this feature vector free from distortion by the speech spectra. The performance of the proposed method is evaluated and compared with a correlation-based HRTF database matching approach and a two-step localization technique for multiple source positions, HRTFs (individuals) and speech inputs. The results suggest that up to 20% improvement in localization performance can be achieved for moderate signal-to-noise ratios.

[1]  José Santos-Victor,et al.  Sound Localization for Humanoid Robots - Building Audio-Motor Maps based on the HRTF , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[2]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[3]  Harald Viste,et al.  Binaural Source Localization by Joint Estimation of ILD and ITD , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Gregory H. Wakefield,et al.  Introduction to Head-Related Transfer Functions (HRTFs): Representations of HRTFs in Time, Frequency, and Space , 2001 .

[5]  Kazuhiro Iida,et al.  Median plane localization using a parametric model of the head-related transfer function based on spectral cues , 2007 .

[6]  C. Avendano,et al.  The CIPIC HRTF database , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[7]  DeLiang Wang,et al.  Binaural Localization of Multiple Sources in Reverberant and Noisy Environments , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[9]  Keith D. Martin Estimating azimuth and elevation from interaural differences , 1995, Proceedings of 1995 Workshop on Applications of Signal Processing to Audio and Accoustics.

[10]  Buket D. Barkana,et al.  Energy Estimation between Adjacent Formant Frequencies to Identify Speaker's Gender , 2008, Fifth International Conference on Information Technology: New Generations (itng 2008).

[11]  Hong Liu,et al.  A binaural sound source localization model based on time-delay compensation and interaural coherence , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Fakheredine Keyrouz,et al.  Advanced Binaural Sound Localization in 3-D for Humanoid Robots , 2014, IEEE Transactions on Instrumentation and Measurement.

[13]  Toshiharu Mukai,et al.  3D sound source localization system based on learning of binaural hearing , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[14]  F. Keyrouz,et al.  An Enhanced Binaural 3D Sound Localization Algorithm , 2006, 2006 IEEE International Symposium on Signal Processing and Information Technology.

[15]  Thushara D. Abhayapala,et al.  Binaural localization of speech sources in the median plane using cepstral hrtf extraction , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[16]  Volker Willert,et al.  A Probabilistic Model for Binaural Sound Localization , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[17]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..