Tracking Body Parts of Multiple People for Multi-person Multimodal Interface

Although large displays could allow several users to work together and to move freely in a room, their associated interfaces are limited to contact devices that must generally be shared. This paper describes a novel interface called SHIVA (Several-Humans Interface with Vision and Audio) allowing several users to interact remotely with a very large display using both speech and gesture. The head and both hands of two users are tracked in real time by a stereo vision based system. From the body parts position, the direction pointed by each user is computed and selection gestures done with the second hand are recognized. Pointing gesture is fused with n-best results from speech recognition taking into account the application context. The system is tested on a chess game with two users playing on a very large display.

[1]  Shaogang Gong,et al.  Tracking multiple people with a multi-camera system , 2001, Proceedings 2001 IEEE Workshop on Multi-Object Tracking.

[2]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[3]  Trevor Darrell,et al.  Multiple person and speaker activity tracking with a particle filter , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Rainer Stiefelhagen,et al.  3D-tracking of head and hands for pointing gesture recognition in a human-robot interaction scenario , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[5]  Mohammed Yeasin,et al.  A real-time framework for natural multimodal interaction with large screen displays , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[6]  A BoltRichard,et al.  Put-that-there , 1980 .

[7]  Luc Van Gool,et al.  Real-time pointing gesture recognition for an immersive environment , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[8]  Nebojsa Jojic,et al.  Detection and estimation of pointing gestures in dense disparity maps , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[9]  Rajeev Sharma,et al.  Understanding Gestures in Multimodal Human Computer Interaction , 2000, Int. J. Artif. Intell. Tools.

[10]  Trevor Darrell,et al.  3-D articulated pose tracking for untethered diectic reference , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[11]  Katsuhiko Sakaue,et al.  Arm-pointing gesture interface using surrounded stereo cameras system , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[12]  Sharon Oviatt,et al.  Integration and synchronization of input modes during multimodal human-computer interaction , 1997 .

[13]  Raphaël Féraud,et al.  A Fast and Accurate Face Detector Based on Neural Networks , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Yacine Bellik,et al.  Technical Requirements for a Successful Multimodal Interaction , 2002 .

[15]  Antonella De Angeli,et al.  Integration and synchronization of input modes during multimodal human-computer interaction , 1997, CHI.

[16]  Jacob Eisenstein,et al.  A Salience-Based Approach to Gesture-Speech Alignment , 2004, HLT-NAACL.

[17]  J. K. Aggarwal,et al.  Tracking and recognizing two-person interactions in outdoor image sequences , 2001, Proceedings 2001 IEEE Workshop on Multi-Object Tracking.

[18]  Mohammed Yeasin,et al.  A tracking framework for collaborative human computer interaction , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[19]  Rainer Stiefelhagen,et al.  Head pose estimation using stereo vision for human-robot interaction , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..