The Video Conference Tool Robot ViCToR

We present a robotic tool that autonomously follows a conversation to enable remote presence in video conferencing. When humans participate in a meeting with the help of video conferencing tools, it is crucial that they are able to follow the conversation both with acoustic and visual input. To this end, we design and implement a video conferencing tool robot that uses binaural sound source localization as its main source to autonomously orient towards the currently talking speaker. To increase robustness of the acoustic cue against noise we supplement the sound localization with a source detection stage. Also, we include a simple onset detector to retain fast response times. Since we only use two microphones, we are confronted with ambiguities on whether a source is in front or behind the device. We resolve these ambiguities with the help of face detection and additional moves. We tailor the system to our target scenarios in experiments with a four minute scripted conversation. In these experiments we evaluate the influence of different system settings on the responsiveness and accuracy of the device.

[1]  Ben Supper,et al.  An auditory onset detection algorithm for improved automatic source localization , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Radu Horaud,et al.  Online multimodal speaker detection for humanoid robots , 2012, 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012).

[3]  Paul A. Viola,et al.  Boosting-Based Multimodal Speaker Detection for Distributed Meeting Videos , 2008, IEEE Transactions on Multimedia.

[4]  Cynthia Breazeal,et al.  MeBot: a robotic platform for socially embodied presence , 2010, HRI.

[5]  Cynthia Breazeal,et al.  MeBot: A robotic platform for socially embodied telepresence , 2010, 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[6]  S. Shamma,et al.  Interaction between Attention and Bottom-Up Saliency Mediates the Representation of Foreground and Background in an Auditory Scene , 2009, PLoS biology.

[7]  Steven van de Par,et al.  A Probabilistic Model for Robust Localization Based on a Binaural Auditory Front-End , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Volker Hohmann,et al.  The influence of pause, attack, and decay duration of the ongoing envelope on sound lateralization. , 2015, The Journal of the Acoustical Society of America.

[9]  Michael T. Lippert,et al.  Mechanisms for Allocating Auditory Attention: An Auditory Saliency Map , 2005, Current Biology.

[10]  Thomas Deselaers,et al.  Randomized trees for real-time one-step face detection and recognition , 2008, 2008 19th International Conference on Pattern Recognition.

[11]  J. Blauert Spatial Hearing: The Psychophysics of Human Sound Localization , 1983 .

[12]  H S Colburn,et al.  The precedence effect. , 1999, The Journal of the Acoustical Society of America.

[13]  C. Faller,et al.  Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[14]  Silvia Coradeschi,et al.  A Review of Mobile Robotic Telepresence , 2013, Adv. Hum. Comput. Interact..

[15]  Paul A. Viola,et al.  Face Recognition Using Boosted Local Features , 2003 .

[16]  H.S. Jamadagni,et al.  VAD techniques for real-time speech transmission on the Internet , 2002, 5th IEEE International Conference on High Speed Networks and Multimedia Communication (Cat. No.02EX612).