Visual tracking and recognition of the human hand

Intelligent human-computer interaction has long been a heated area of research. Many vision-based hand gesture tracking and recognition approaches are proposed. However, most of them are either confined to a fixed set of static gestures or only able to track 2D global hand motion. In order to recognize natural hand gestures such as those in sign language, we need to track articulated hand motion in real time. It remains an open problem due to high degrees of freedom of the hand, self-occlusion, variable views, and lighting. This dissertation focuses on automatic recovery of three-dimensional hand motion from one or more views. The problem of hand tracking is formulated as Bayesian filtering. A 3D kinematic hand model is used to generate contours. Landmark points around the contours are matched with the edge and skin color features extracted from the input image. Therefore, the parameters of the kinematic hand constitute the state space and the image features are the observations. Our tracking framework is analysis-by-synthesis; that is, we generate a set of hypotheses based on the previous tracking results and evaluate the hypotheses by matching with the input images. In order to track efficiently, the number of hypotheses should be small. Meanwhile, the hypotheses should to be close to the actual hand configuration. To accomplish this task we need two components: (1) A dynamic model of the hand to predict the finger motion given current configuration, and (2) a fast and accurate likelihood evaluation algorithm. For the first component, we propose an eigen dynamic analysis (EDA) method to model the finger dynamics. This serves as the top-down guideline for generating hypotheses. For the second component, we propose a new feature called likelihood edge to match the landmark points on the contour with feature points in the image. To automatically initialize and recover from loss-track, we proposed a bottom-up posture recognition algorithm. It collectively matches the local features in a single image with those in the image database. Through quantitative and visual experimental results, we demonstrate the effectiveness of our approach and point out its limitations.