Understanding tools: Task-oriented object modeling, learning and recognition

In this paper, we present a new framework - task-oriented modeling, learning and recognition which aims at understanding the underlying functions, physics and causality in using objects as “tools”. Given a task, such as, cracking a nut or painting a wall, we represent each object, e.g. a hammer or brush, in a generative spatio-temporal representation consisting of four components: i) an affordances basis to be grasped by hand; ii) a functional basis to act on a target object (the nut), iii) the imagined actions with typical motion trajectories; and iv) the underlying physical concepts, e.g. force, pressure, etc. In a learning phase, our algorithm observes only one RGB-D video, in which a rational human picks up one object (i.e. tool) among a number of candidates to accomplish the task. From this example, our algorithm learns the essential physical concepts in the task (e.g. forces in cracking nuts). In an inference phase, our algorithm is given a new set of objects (daily objects or stones), and picks the best choice available together with the inferred affordance basis, functional basis, imagined human actions (sequence of poses), and the expected physical quantity that it will produce. From this new perspective, any objects can be viewed as a hammer or a shovel, and object recognition is not merely memorizing typical appearance examples for each category but reasoning the physical mechanisms in various tasks to achieve generalization.

[1]  E. Menzel Animal Tool Behavior: The Use and Manufacture of Tools by Animals, Benjamin B. Beck. Garland STPM Press, New York and London (1980), 306, Price £24.50 , 1981 .

[2]  L. Uhr,et al.  Representing and using functional definitions for visual recognition , 1987 .

[3]  R. Byrne,et al.  Machiavellian intelligence : social expertise and the evolution of intellect in monkeys, apes, and humans , 1990 .

[4]  A. Harcourt The Chimpanzees of Gombe. Patterns of Behavior, Jane Goodall. Belknap Press of Harvard University Press, Cambridge, Massachussets (1986), xii, +671. Price $30 , 1988 .

[5]  L. Stark,et al.  Dissertation Abstract , 1994, Journal of Cognitive Education and Psychology.

[6]  W. McGrew Chimpanzee Material Culture: Implications for Human Evolution , 1992 .

[7]  M. Tomasello Tools, Language and Cognition in Human Evolution , 1994 .

[8]  A. Whiten,et al.  Cultures in chimpanzees , 1999, Nature.

[9]  A. Kacelnik,et al.  Shaping of Hooks in New Caledonian Crows , 2002, Science.

[10]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[11]  Chris Baber Cognition and Tool Use: Forms of Engagement in Human and Animal Use of Tools , 2003 .

[12]  Alexander Stoytchev,et al.  Behavior-Grounded Representation of Tool Affordances , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[13]  Sarah H. Creem-Regehr,et al.  Neural representations of graspable objects: are tools special? , 2005, Brain research. Cognitive brain research.

[14]  F. Fang,et al.  Cortical responses to invisible objects in the human dorsal and ventral pathways , 2005, Nature Neuroscience.

[15]  Laurie R Santos,et al.  Probing the limits of tool competence: Experiments with two non-tool-using species (Cercopithecus aethiops and Saguinus oedipus) , 2006, Animal Cognition.

[16]  JamesW. Lewis Cortical Networks Related to Human Use of Tools , 2006, The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry.

[17]  S. Frey What Puts the How in Where? Tool Use and the Divided Visual Streams Hypothesis , 2007, Cortex.

[18]  T. Higuchi,et al.  Merging particle filter for sequential data assimilation , 2007 .

[19]  Manuel Lopes,et al.  Learning Object Affordances: From Sensory--Motor Coordination to Imitation , 2008, IEEE Transactions on Robotics.

[20]  Bernt Schiele,et al.  Functional Object Class Detection Based on Learned Affordance Cues , 2008, ICVS.

[21]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[22]  F. Osiurak,et al.  Grasping the affordances, understanding the reasoning: toward a dialectical theory of human tool use. , 2010, Psychological review.

[23]  Alexei A. Efros,et al.  From 3D scene geometry to human workspace , 2011, CVPR 2011.

[24]  Luc Van Gool,et al.  What makes a chair a chair? , 2011, CVPR 2011.

[25]  Charles Kemp,et al.  How to Grow a Mind: Statistics, Structure, and Abstraction , 2011, Science.

[26]  Markus Vincze,et al.  Affordance based Part Recognition for Grasping and Manipulation , 2011 .

[27]  Tetsunari Inamura,et al.  Learning of Tool Affordances for autonomous tool manipulation , 2011, 2011 IEEE/SICE International Symposium on System Integration (SII).

[28]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[29]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[30]  Danica Kragic,et al.  Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[31]  Radu Bogdan Rusu,et al.  3D is here: Point Cloud Library (PCL) , 2011, 2011 IEEE International Conference on Robotics and Automation.

[32]  K. Vaesen The cognitive bases of human tool use , 2012, Behavioral and Brain Sciences.

[33]  Markus Vincze,et al.  AfRob: The affordance network ontology for robots , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[34]  Jessica B. Hamrick,et al.  Simulation as an engine of physical scene understanding , 2013, Proceedings of the National Academy of Sciences.

[35]  Nanning Zheng,et al.  Modeling 4D Human-Object Interactions for Event and Object Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Song-Chun Zhu,et al.  Inferring "Dark Matter" and "Dark Energy" from Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[37]  Sanja Fidler,et al.  Holistic Scene Understanding for 3D Object Detection with RGBD Cameras , 2013, 2013 IEEE International Conference on Computer Vision.

[38]  Silvio Savarese,et al.  Understanding Indoor Scenes Using 3D Geometric Phrases , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Hedvig Kjellström,et al.  Functional object descriptors for human activity modeling , 2013, 2013 IEEE International Conference on Robotics and Automation.

[40]  Yun Jiang,et al.  Hallucinated Humans as the Hidden Context for Labeling 3D Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Katsushi Ikeuchi,et al.  Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Song-Chun Zhu,et al.  Scene Parsing by Integrating Function, Geometry and Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  S. Savarese,et al.  Supplemental Material : Understanding Indoor Scenes using 3 D Geometric Phrases , 2013 .

[44]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[45]  Noah D. Goodman,et al.  Learning physics from dynamical scenes , 2014 .

[46]  Katsushi Ikeuchi,et al.  Detecting potential falling objects by inferring human action and natural disturbance , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[47]  Vladimir G. Kim,et al.  Shape2Pose: human-centric shape analysis , 2014, ACM Trans. Graph..

[48]  Y. Aloimonos,et al.  Affordance of Object Parts from Geometric Features , 2014 .

[49]  Gloria Sabbatini,et al.  Sequential use of rigid and pliable tools in tufted capuchin monkeys (Sapajus spp.) , 2014, Animal Behaviour.

[50]  Li Fei-Fei,et al.  Reasoning about Object Affordances in a Knowledge Base Representation , 2014, ECCV.

[51]  Yiannis Aloimonos,et al.  Affordance detection of tool parts from geometric features , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[52]  Antonis A. Argyros,et al.  Towards force sensing from vision: Observing hand-object interactions to infer manipulation forces , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).