The Articulated Scene Model: Model-less Priors for Robot Object Learning?

Human analysis of dynamic scenes consists of two parallel processing chains [2]. The rst one concentrates on motion which is dened as variation of location while the second one processes change which is the variation of structure. The detection of a scene change is realized phenomenologically by comparing currently visible structures with a representation in memory. These psychological ndings have motivated us to design an articulated scene modeling approach [1] which enables a robot to extract articulated scene parts through observing the spatial changes caused by their manipulation. This approach processes a sequence of 3D scans taken from a xed view point which captures a dynamic scene where a human moves around and manipulates the environment by, e.g., replacing chairs or opening doors. It estimates per frame Ft the actively moving persons Et, the so far static scene background St, and movable objects / articulated scene parts Ot. The moving persons are tracked using a particle lter with a weak cylinder model. Static and movable scene parts are computed by a comparison of the current frame (where the tracked persons have been excluded) with the background model St t estimated from the previous frames. For dense depth sensors, like the SwissRanger camera or the Kinect camera, such a comparison can be implemented as pixel-wise subtraction of St t from Ft. Using the physical fact that per pixel the farthest static depth measurements along a ray dene the static background, the background model is instantaneously adapted to newly uncovered background while arbitrary movable objects (like a replaced chair or an opened cupboard door) arise model-less from depth measurements emerging in front of the known static background. The video 1 2 shows for an Swissranger sequence the emerging of the static background (in blue), the movable objects (in orange), and the trajectories of an entity (in cyan and green) for two view points. The scene modeling part of our approach can also be presented on site in real-time on Kinect data. The development of cameras like the Kinect camera which combine dense depth measurement with a normal color camera in an elegant way, opens up new possibilities for interactive object learning. Future work could concentrate on the question whether the extracted movable objects (like chair) can be used to compute suitable features that can be used to detect, for example, other chairs in the scene which have not been moved, so far. Further, a history of positions of an articulated object like a drawer provides model-less tracks of object parts which can be used to train candidate kinematic models (like rotational, rigid, prismatic) for the observed tracks [3].