Learning visual object definitions by observing human activities

Humanoid robots, while moving in our everyday environments, necessarily need to recognize objects. Providing robust object definitions for every single object in our environments is challenging and impossible in practice. In this work, we build upon the fact that objects have different uses and humanoid robots, while co-existing with humans, should have the ability of observing humans using the different objects and learn the corresponding object definitions. We present an object recognition algorithm, FOCUS, for finding object classifications through use and structure. FOCUS learns structural properties (visual features) of objects by knowing first the object's affordance properties and observing humans interacting with that object with known activities. FOCUS combines an activity recognizer, flexible and robust to any environment, which captures how an object is used with a low-level visual feature processor. The relevant features are then associated with an object definition which is then used for object recognition. The strength of the method relies on the fact that we can define multiple aspects of an object model, i.e., structure and use, that are individually robust but insufficient to define the object, but can do so jointly. We present the FOCUS approach in detail, which we have demonstrated in a variety of activities, objects, and environments. We show illustrating empirical evidence of the efficacy of the method

[1]  Paul Fitzpatrick Object Lesson: Discovering and Learning to Recognize Objects , 2002 .

[2]  Irfan A. Essa,et al.  Exploiting human actions and object context for recognition tasks , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[3]  Marvin Minsky,et al.  Society of Mind Project , 1988 .

[4]  Takeo Kanade,et al.  Learning to Track Multiple People in Omnidirectional Video , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[5]  Marvin Minsky,et al.  Society of Mind: A Response to Four Reviews , 1991, Artif. Intell..

[6]  Azriel Rosenfeld,et al.  Recognition by Functional Parts , 1995, Comput. Vis. Image Underst..

[7]  Raúl Rojas,et al.  Tracking regions and edges by shrinking and growing , 2003 .

[8]  E. Reed The Ecological Approach to Visual Perception , 1989 .

[9]  Demetri Terzopoulos,et al.  Snakes: Active contour models , 2004, International Journal of Computer Vision.

[10]  Ernst D. Dickmanns,et al.  Recursive 3-D Road and Relative Ego-State Recognition , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[12]  Yasuo Kuniyoshi,et al.  Statistical manipulation learning of unknown objects by a multi-fingered robot hand , 2004, 4th IEEE/RAS International Conference on Humanoid Robots, 2004..

[13]  Yan Ke,et al.  PCA-SIFT: a more distinctive representation for local image descriptors , 2004, CVPR 2004.

[14]  Artur Arsenio Object Recognition from Multiple Percepts , 2004 .

[15]  Shigeki Aoki,et al.  Scene recognition based on relationship between human actions and objects , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[16]  Michael Isard,et al.  Active Contours , 2000, Springer London.

[17]  Alex Pentland,et al.  Probabilistic visual learning for object detection , 1995, Proceedings of IEEE International Conference on Computer Vision.

[18]  Matthew Brand,et al.  Physics-Based Visual Understanding , 1997, Comput. Vis. Image Underst..

[19]  Darrin C. Bentivegna,et al.  Learning from Observation and Practice at the Action Generation Level , 2003 .

[20]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[21]  M. Veloso,et al.  Using Sparse Visual Data to Model Human Activities in Meetings , 2004 .

[22]  Antonio Torralba,et al.  Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes , 2003, NIPS.

[23]  Kevin W. Bowyer,et al.  Generic recognition through qualitative reasoning about 3-D shape and object function , 1991, CVPR.