GCC-PHAT with Speech-oriented Attention for Robotic Sound Source Localization

Robotic audition is a basic sense that helps robots perceive the surroundings and interact with humans. Sound Source Localization (SSL) is an essential module for a robotic system. However, the performance of most sound source localization techniques degrades in noisy and reverberant environments due to inaccurate Time Difference of Arrival (TDoA) estimation. In robotic sound source localization, we are more interested in detecting the arrival of human speech than other sound sources. Ideally, we expect an effective TDoA estimation to respond only to speech signals, while masking off other interferences. In this paper, we propose a novel technique that learns to attend to speech fundamental frequency and harmonics while suppressing noise interference and reverberation. The novel TDoA feature is referred to as Generalized Cross Correlation with Phase Transform and Speech Mask (GCC-PHAT-SM). We perform sound source localization experiments on real-world data captured from a robotic platform. Experiments show that GCC-PHAT-SM feature significantly outperforms traditional Generalized Cross Correlation (GCC) feature in noisy and reverberant acoustic environments.