Relational models for human-object interactions and object affordances

Humans have the remarkable ability to look at an image and decipher a wealth of information about the scene it depicts. In particular, humans are very good at the following: 1) Locating various objects in the scene and telling how they are oriented, and 2) Understanding the different activities that the people in the scene are engaged in. In this thesis, we tackle the problem of helping a vision system achieve these two goals. Our techniques build on the observation that human activities, their poses and the nature of their interaction with objects in their environment are all dependent on each other. The thesis proposes techniques to model such dependencies and learns the nature of person-object interactions from training examples. While previous approaches typically deal with problems of object-detection, human-body pose estimation and activity recognition as isolated problems, the thesis proposes a framework that jointly addresses them. We show superior empirical performance to state-of-the-art techniques designed for each of the three tasks. A 2D pictorial structure framework is at the core of our approach. By handling appearance changes due to viewpoint, 3D pose and occlusions in a principled fashion, we show that such models can be quite powerful and versatile.