Speech activity detection and face orientation estimation using multiple microphone arrays and human position information

We developed a system for detecting the speech activity intervals of multiple speakers by combining multiple microphone arrays and human tracking technologies. We also proposed a method for estimating the face orientation of the detected speakers. The developed system was evaluated in two steps: individual utterances in different positions and orientations; and simultaneous dialogues by multiple speakers. Evaluation results revealed that the proposed system could detect speech activity intervals with more than 90% of accuracy, and face orientations with standard deviations within 30 degrees, in situations excluding the cases where all arrays are in the opposite direction to the speaker's face orientation.

[1]  Rainer Stiefelhagen,et al.  Visual recognition of pointing gestures for human-robot interaction , 2007, Image Vis. Comput..

[2]  Hiroshi Ishiguro,et al.  Evaluation of a MUSIC-based real-time sound localization of multiple sound sources in real noisy environments , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[3]  Norihiro Hagita,et al.  Integration of Multiple Microphone Arrays and Use of Sound Reflections for 3D Localization of Sound Sources , 2014, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[4]  Alexander H. Waibel,et al.  Enabling Multimodal Human–Robot Interaction for the Karlsruhe Humanoid Robot , 2007, IEEE Transactions on Robotics.

[5]  Patrick Danès,et al.  Broadband variations of the MUSIC high-resolution method for Sound Source Localization in Robotics , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6]  Tanja Schultz,et al.  Identifying the addressee in human-human-robot interactions based on head pose and speech , 2004, ICMI '04.

[7]  Kazuhiro Nakadai,et al.  Real-time sound source orientation estimation using a 96 channel microphone array , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  Carlos Segura,et al.  Multimodal Head Orientation Towards Attention Tracking in Smartrooms , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[9]  Alexander H. Waibel,et al.  From Gaze to Focus of Attention , 1999, VISUAL.

[10]  Hiroshi Ishiguro,et al.  Laser tracking of human body motion using adaptive shape modeling , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  Alessio Brutti,et al.  Oriented global coherence field for the estimation of the head orientation in smart rooms equipped with distributed microphone arrays , 2005, INTERSPEECH.

[12]  Norihiro Hagita,et al.  Combining laser range finders and local steered response power for audio monitoring , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13]  Hiroshi G. Okuno,et al.  Real-Time Tracking of Multiple Sound Sources by Integration of In-Room and Robot-Embedded Microphone Arrays , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.