Real-time audio-visual localization of user using microphone array and vision camera

In home environments, demands for a robot to serve a user are on the increase, such as cleaning rooms, bringing something to the user, and so on. To achieve these tasks, it is essential for developing a natural way of human-robot interaction (HRI). One of the most natural ways is that the robot approaches the user to do some tasks after recognizing the user's call and localizing its position. In this case, user localization becomes a key technology. In this paper, we propose a novel audio visual user localization system. It consists of a microphone array with eight sensors and a video camera. Estimating calling direction is achieved by the spectral subtraction of the spatial spectra. In particular, a novel beam forming method is proposed to suppress the nonstationary audio noises where they always occur in a real world. Furthermore, a robust method for face detection is proposed to double check the user based on an Adaboost classifier. It is improved to reduce the false alarms remarkably through a new postprocessing on face candidates. Successful results in a real home environment show its efficacy and feasibility. The implementation issues, limitations, and their possible solutions are also discussed.

[1]  Hiroaki Kitano,et al.  Robot recognizes three simultaneous speech by active audition , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[2]  Hiroaki Kitano,et al.  Applying scattering theory to robot audition system: robust sound source localization and extraction , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[3]  Hiroaki Kitano,et al.  Human-robot interaction through real-time auditory and visual multiple-talker tracking , 2001, Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180).

[4]  Don H. Johnson,et al.  Array Signal Processing: Concepts and Techniques , 1993 .

[5]  Hideki Asoh,et al.  Sound source localization and signal separation for office robot "JiJo-2" , 1999, Proceedings. 1999 IEEE/SICE/RSJ. International Conference on Multisensor Fusion and Integration for Intelligent Systems. MFI'99 (Cat. No.99TH8480).

[6]  Satoshi Nakamura,et al.  Speech enhancement based on the subspace method , 2000, IEEE Trans. Speech Audio Process..

[7]  Narendra Ahuja,et al.  Detecting Faces in Images: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Mohamed El-Tanany,et al.  Robust near-field adaptive beamforming with distance discrimination , 2004, IEEE Transactions on Speech and Audio Processing.

[9]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[10]  Kung Yao,et al.  Maximum-likelihood source localization and unknown sensor location estimation for wideband signals in the near-field , 2002, IEEE Trans. Signal Process..

[11]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[12]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[13]  Sang Min Yoon,et al.  Separation of multiple concurrent speeches using audio-visual speaker localization and minimum variance beam-forming , 2004, INTERSPEECH.