Still looking at people

There is a great need for programs that can describe what people are doing from video. Among other applications, such programs could be used to search for scenes in consumer video; in surveillance applications; to support the design of buildings and of public places; to screen humans for diseases; and to build enhanced human computer interfaces. Building such programs is difficult, because it is hard to identify and track people in video sequences, because we have no canonical vocabulary for describing what people are doing, and because phenomena such as aspect and individual variation greatly affect the appearance of what people are doing. Recent work in kinematic tracking has produced methods that can report the kinematic configuration of the body automatically, and with moderate accuracy. While it is possible to build methods that use kinematic tracks to reason about the 3D configuration of the body, and from this the activities, such methods remain relatively inaccurate. However, they have the attraction that one can build models that are generative, and that allow activities to be assembled from a set of distinct spatial and temporal components. The models themselves are learned from labelled motion capture data and are assembled in a way that makes it possible to learn very complex finite automata without estimating large numbers of parameters. The advantage of such a model is that one can search videos for examples of activities specified with a simple query language, without possessing any example of the activity sought. In this case, aspect is dealt with by explicit 3D reasoning. An alternative approach is to model the whole problem as k-way classification into a set of known classes. This approach is much more accurate at present, but has the difficulty that we don't really know what the classes should be in general. This is because we do not know how to describe activities. Recent work in object recognition on describing unfamiliar objects suggests that activities might be described in terms of attributes -- properties that many activities share, that are easy to spot, and that are individually somewhat discriminative. Such a description would allow a useful response to an unfamiliar activity. I will sketch current progress on this agenda.