Real-time gesture recognition using deterministic boosting

A gesture recognition system which can reliably recognize single-hand gestures in real time on a 600Mhz notebook computer is described. The system has a vocabulary of 46 gestures including the American sign language letterspelling alphabet and digits. It includes mouse movements such as drag and drop, and is demonstrated controlling a windowed operating system, editing a document and performing file-system operations with extremely low error rates over long time periods. Real-time performance is provided by a novel combination of exemplar-based classification and a new “deterministic boosting” algorithm which can allow for fast online retraining. Importantly, each frame of video is processed independently: no temporal Markov model is used to constrain gesture identity, and the search region is the entire image. This places stringent requirements on the accuracy and speed of recognition, which are met by our proposed architecture. Gesture recognition is an area of active current research in computer vision. The prospect of a user-interface in which natural gestures can be used to enhance human-computer interaction brings visions of more accessible computer systems, and ultimately of higher bandwidth interactions than will be possible using keyboard and mouse alone. This paper describes a system for automatic real-time control of a windowed operating system entirely based on one-handed gesture recognition. Using a computer-mounted video camera as the sensor, the system can reliably interpret a 46-element gesture set at 15 frames per second on a 600MHz notebook PC. The gesture set, shown in figure 1, comprises the 36 letters and digits of the American Sign Language fingerspelling alphabet [ 9], three ‘mouse buttons’, and some ancillary gestures. The accompanying video, and figure 4, show a transcript of about five minutes of system operation in which files are created, renamed, moved, and edited—entirely under gesture control. This corresponds to a per-image recognition rate of over 99%, which exceeds any reported system to date, whether or not real-time. This significant improvement in performance is the outcome of three factors: 1. The general engineering of the system means that preprocessing is reliable on every frame. Lighting is controlled with just enough care to ensure that most skin pixels are detected using simple image processing. The user wears a coloured wrist band which allows the orientation of the hand to be easily computed. 2. An exemplar-based classifier[11, 19] ensures that recognition of the gesture label from a preprocessed image uses a rich, informative model, allowing a large gesture vocabulary to be employed. 817

[1]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Thomas B. Moeslund,et al.  Real-time recognition of hand alphabet gestures using principal component analysis , 1997 .

[3]  David C. Hogg,et al.  Towards 3D hand tracking using a deformable model , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[4]  Mansoor Sarhadi,et al.  Non-linear statistical models for the 3D reconstruction of human pose and motion from monocular image sequences , 2000, Image Vis. Comput..

[5]  Martin L. A. Sternberg American Sign Language Dictionary , 1981 .

[6]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Dariu Gavrila,et al.  Real-time object detection for "smart" vehicles , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[8]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[9]  Mansoor Sarhadi,et al.  Building Temporal Models for Gesture Recognition , 2000, BMVC.

[10]  Dimitris N. Metaxas,et al.  Parallel hidden Markov models for American sign language recognition , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[11]  Timothy F. Cootes,et al.  Tracking and Recognising Hand Gestures using Statistical Shape Models , 1995, BMVC.

[12]  Andrew Blake,et al.  Probabilistic tracking in a metric space , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[13]  Jérôme Martin,et al.  An Appearance-Based Approach to Gesture-Recognition , 1997, ICIAP.

[14]  Paulo R. S. Mendonça,et al.  Model-based 3D tracking of an articulated hand , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[15]  Shaogang Gong,et al.  A Multi-View Nonlinear Active Shape Model Using Kernel PCA , 1999, BMVC.

[16]  Takeo Kanade,et al.  DigitEyes: Vision-Based Human Hand Tracking , 1993 .

[17]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[18]  Vladimir Pavlovic,et al.  Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.