Detecting and localizing human figures

In this thesis we address the problem of detection and localization of human figures in still images. The goal is to develop algorithms that can take as input a single image containing a human body and locate its joint positions. This problem is a difficult one, but it is an extremely important one with many applications. The ability to accurately find people in still images would facilitate many useful applications such as initializing 3D kinematic trackers, understanding human actions, and re-rendering for graphics. We have developed two complementary approaches in an attempt to tackle this problem. The first is an exemplar-based method, presented in Chapter 3. The basic approach is to store a number of 2d views of the human body in a variety of different configurations and viewpoints with respect to the camera. On each of these stored views, the locations of the body joints (left elbow, right knee, etc.) are manually marked and labelled for future use. These labeled images are the exemplars. The input image is then matched to each stored view. Assuming that there is a stored view sufficiently similar in configuration and pose, the locations of the body joints can then be transferred from the exemplar view to the input image in order to localize the human figure. The process of matching to exemplars is performed with shape matching, and we present preliminaries on the details of this matching in Chapter 2. In particular, we demonstrate that a recently introduced shape descriptor, the “shape context”, can be used to quickly prune a search for similar shapes. This ability to quickly sift through a large collection of stored shapes is important for exemplar-based approaches that require a large number of stored views of the human body. The second approach for localizing human figures, developed in Chapter 4, is a parts-based method. Instead of storing a set of full body configurations we instead model the human body as a collection of parts corresponding to “half-limbs” (upper and lower arms and legs) and a torso. Given an input image we attempt to detect these parts and then assemble them into a human figure. As mentioned above, in a general setting the problem we are tackling is an extremely difficult one. By no means do we claim to have solved it in this thesis; much work still remains to be done. However, in a restricted domain the problem becomes more tractable. In Chapter 5, we develop a novel graphics application called motion synthesis which uses the human body joint positions obtained via our methods in producing synthetic videos of human figures in motion. For this application only a single human figure, moving through a relatively limited set of body configurations, need be localized. This limited domain is amenable to success using the exemplar-based method.