Human Shape and Motion from Video

In recent years, because cameras have become inexpensive and ever more prevalent, there has been increasing interest in modeling human shape and motion from image data. This type of modeling has many applications, such as electronic publishing, entertainment, sports medicine and athletic training. This, however, is an inherently difficult task, both because the body is very complex and because the data that can be extracted from images is often incomplete, noisy and ambiguous. EPFL’s Computer Vision Laboratory seeks to overcome these difficulties by using facial and body anima tion models, not only to represent the data, but also to guide the fitting pr ocess, thereby substantially improving performance. Start from sophisticated 3-D animation models, we reformulate them so that they can be used for data analysis in the three following research areas. 1 Augmented reality and 3-D tracking In augmented reality applications, tracking and registrat ion of cameras and objects are required because, to combine real and rendered scenes, we must project synthetic models at the right location in real images. As shown in Fig. 1, we have developed robust realtime methods for 3-D tracking of rigid objects and human faces [9, 10]. We formulate the tracking problem in terms of local bundle adjustment and merge the information from preceding frames with that provided by a very limited number of keyframes created during a training stage, which results in a real-time tracker that d oes not jitter or drift and can deal with significant aspect changes. We have also developed the f ast 3-D object detection and pose estimation method [4, 5] which can be used to initialize or reinitialize the tracker in real-time. It relies on matching keypoints but, by contrast with previous methods that rely either on using ad hoc local descriptors or on estimating local affine deformations, the wide baseline matching of these keypoints is treated as a classification problem, in which each class corresponds to the set of all possible views of such a point. We synthesize a large number of views of individual keypoints of the object and train a classifier to recognize them. At run-time, we rely on this description to decide to which class, if any, an observed feature belongs. This formulation allows us to use powerful and fast classification methods to reduce matching error rates.