Gesture recognition using the Perseus architecture

Communication involves more than simply spoken information. Typical interactions use gestures to accurately and efficiently convey ideas that are more easily expressed with actions than words. A more intuitive interface with machines should involve not only speech recognition, but gesture recognition as well. One of the most frequently used and expressively powerful gestures is pointing. It is far easier and more accurate to point to an object than give a verbal description of its location. To produce a more efficient, accurate, and natural human-machine interface we use the Perseus architecture to interpret the pointing gesture. Perseus uses a variety of techniques to reliably solve this complex visual problem in non-engineered worlds. Knowledge about the task and environment is used at all stages of processing to best interpret the scene for the current situation. Once the visual operators are chosen, contextual knowledge is used to tune them for maximal performance. Redundant interpretation of the scene provides robustness to errors in interpretation. Fusion of independent types of information results in increased tolerance when assumptions about the environment fail. Windows of attention are used to improve speed and remove distractions from the scene. Furthermore, reuse is a major issue in the design of Perseus. Information about the environment and task is explicitly represented so it can easily be re-used in tasks other than pointing. A clean interface to Perseus is provided for symbolic higher level systems like the RAP reactive execution system. In this paper we describe Perseus in detail and show how it is used to locate objects pointed to by people.

[1]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[2]  S. Ullman Visual routines , 1984, Cognition.

[3]  Katsushi Ikeuchi,et al.  Determining Grasp Configurations using Photometric Stereo and the PRISM Binocular Stereo System , 1986 .

[4]  Robert James Firby,et al.  Adaptive execution in complex dynamic worlds , 1989 .

[5]  David Chapman,et al.  Vision, instruction, and action , 1990 .

[6]  Daniel P. Huttenlocher,et al.  A multi-resolution technique for comparing images using the Hausdorff distance , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Ian Horswill,et al.  Specialization of perceptual processes , 1993 .

[8]  William T. Freeman,et al.  Television control by hand gestures , 1994 .

[9]  Michael J. Swain,et al.  Task and Environment-Sensitive Tracking, , 1994 .

[10]  R. E. Kahn,et al.  Understanding people pointing: the Perseus system , 1995, Proceedings of International Symposium on Computer Vision - ISCV.

[11]  Michael J. Swain,et al.  Collecting Trash: A Test of Purposive Vision , 1995 .

[12]  Michael J. Swain,et al.  Programming CHIP for the IJCAI-95 Robot Competition , 1996, AI Mag..

[13]  Alex Pentland,et al.  Pfinder: real-time tracking of the human body , 1996, Other Conferences.

[14]  R. James Firby,et al.  GARGOYLE: Context-sensitive active vision for mobile robots , 1996 .

[15]  David Kortenkamp,et al.  Recognizing and Interpreting Gestures on a Mobile Robot , 1996, AAAI/IAAI, Vol. 2.

[16]  Trevor Darrell,et al.  A novel environment for situated vision and behavior , 1994 .

[17]  Korten Kamp,et al.  Recognizing and interpreting gestures on a mobile robot , 1996, AAAI 1996.

[18]  Alex Pentland,et al.  Pfinder: Real-Time Tracking of the Human Body , 1997, IEEE Trans. Pattern Anal. Mach. Intell..