Robot developmental learning of an object ontology grounded in sensorimotor experience

How can a robot learn to conceptualize its environment in terms of objects and actions, starting from its intrinsic "pixel-level" sensorimotor interface? Several domains in artificial intelligence (including language, planning, and logic) rely on the existence of a symbolic representation that provides objects, relations, and actions. With real robots it is difficult to ground these high-level symbolic representations, because hand-written object models and control routines are often brittle and fail to account for the complexities of the real world. In contrast, developmental psychologists describe how an infant's naive understanding of objects transforms with experience into an adult's more sophisticated understanding. Can a robot's understanding of objects develop similarly? This thesis describes a learning process that leads to a simple and useful theory of objects, their properties, and the actions that apply to them. The robot's initial "pixel-level" experience consists of a range-sensor image stream and a background model of its immediate surroundings. The background model is an occupancy grid that explains away most of the range-sensor data using a static world assumption. To this developing robot, an "object" is a theoretical construct abduced to explain a subset of the robot's sensorimotor experience that is not explained by the background model. This approach leads to the Object Perception and Action Learner (OPAL). OPAL starts with a simple theory of objects that is used to bootstrap more sophisticated capabilities. In the initial theory, the sensor returns explained by an object have spatial and temporal proximity. This allows the robot to individuate, track, describe, and classify objects (such as a chair or wastebasket) in a simple scene without complex prior knowledge. The initial theory is used to learn a more sophisticated theory. First, the robot uses the perceptual representations described above to create structurally consistent object models that support object localization and recognition. Second, the robot learns actions that support planning to achieve object-based goals. The combined system extends the robot's representational capabilities to include objects and both constant and time-varying properties of objects. The robot can use constant properties such as shape to recognize objects it has previously observed. It can also use time-varying properties such as location or orientation to track objects that move. These properties can be used to represent the learned preconditions and post-conditions of actions. Thus, the robot can make and execute plans to achieve object-based goals, using the pre- and post-conditions to infer the ordering constraints among actions in the plan. The learning process and the learned representations were evaluated with metrics that support verification by both the robot and the experimenter. The robot learned object shape models that are structurally consistent to within the robot's sensor precision. The learned shape models also support accurate object classification with externally provided labels. The robot achieved goals specified in terms of object properties by planning with the learned actions, solving tasks such as facing an object, approaching an object, and moving an object to a target location. The robot completed these tasks both reliably and accurately.

[1]  G. Schaller,et al.  TOOL-USING BEHAVIOR OF THE CALIFORNIA SEA OTTER , 1964 .

[2]  Takeo Kanade,et al.  A System for Video Surveillance and Monitoring , 2000 .

[3]  P. Cavanagh,et al.  Attention-based visual routines: sprites , 2001, Cognition.

[4]  Yoonsuck Choe,et al.  Motion-Based Autonomous Grounding: Inferring External World Properties from Encoded Internal Sensory States Alone , 2006, AAAI.

[5]  Robert B. Fisher,et al.  Object-based visual attention for computer vision , 2003, Artif. Intell..

[6]  Elizabeth S. Spelke,et al.  Principles of Object Perception , 1990, Cogn. Sci..

[7]  Jane Goodall Tool using in primates and other vertebrates , 1970 .

[8]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[9]  J.-S. Gutmann,et al.  AMOS: comparison of scan matching approaches for self-localization in indoor environments , 1996, Proceedings of the First Euromicro Workshop on Advanced Mobile Robots (EUROBOT '96).

[10]  Takeo Kanade,et al.  Visual hull alignment and refinement across time: a 3D reconstruction algorithm combining shape-from-silhouette with stereo , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[11]  Vladimir Kolmogorov,et al.  What energy functions can be minimized via graph cuts? , 2002, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Daniel Kersten,et al.  Bootstrapped learning of novel objects. , 2003, Journal of vision.

[13]  Marc Levoy,et al.  A volumetric method for building complex models from range images , 1996, SIGGRAPH.

[14]  Tim Oates,et al.  The Thing that we Tried Didn't Work very Well: Deictic Representation in Reinforcement Learning , 2002, UAI.

[15]  Michael Bowling,et al.  Subjective Mapping , 2006, AAAI.

[16]  Benjamin Kuipers,et al.  Bootstrap learning for place recognition , 2002, AAAI/IAAI.

[17]  Benjamin Kuipers,et al.  Local metrical and global topological maps in the hybrid spatial semantic hierarchy , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[18]  I. Pepperberg The Alex Studies: Cognitive and Communicative Abilities of Grey Parrots , 2000 .

[19]  Benjamin Kuipers,et al.  Autonomous Development of a Grounded Object Ontology by a Learning Robot , 2007, AAAI.

[20]  I. Nelken Demonstrations of Auditory Scene Analysis: The Perceptual Organization of Sound by Albert S. Bregman and Pierre A. Ahad, MIT Press, 1996. £15.95 CD , 1997, Trends in Neurosciences.

[21]  Sebastian Thrun,et al.  Online simultaneous localization and mapping with detection and tracking of moving objects: theory and results from a ground vehicle in crowded urban areas , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[22]  Elizabeth S. Spelke,et al.  Visual Representation in the Wild: How Rhesus Monkeys Parse Objects , 2001, Journal of Cognitive Neuroscience.

[23]  Benjamin Kuipers,et al.  Bootstrap learning for object discovery , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[24]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Susan Carey,et al.  Infants' knowledge of objects: beyond object files and object tracking , 2001, Cognition.

[26]  R. Johansson,et al.  Prediction Precedes Control in Motor Learning , 2003, Current Biology.

[27]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[28]  Scott Sanner,et al.  Towards object mapping in non-stationary environments with mobile robots , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[29]  Yair Weiss Bayesian motion estimation and segmentation , 1998 .

[30]  Ronald L. Graham,et al.  An Efficient Algorithm for Determining the Convex Hull of a Finite Planar Set , 1972, Inf. Process. Lett..

[31]  Wolfram Burgard,et al.  Monte Carlo Localization with Mixture Proposal Distribution , 2000, AAAI/IAAI.

[32]  P. Bloom How Children Learn the Meaning of Words and How LSA Does It ( Too ) , 2005 .

[33]  Z. Pylyshyn,et al.  What is a visual object? Evidence from target merging in multiple object tracking , 2001, Cognition.

[34]  Wolfram Burgard,et al.  A Probabilistic Approach to Concurrent Mapping and Localization for Mobile Robots , 1998, Auton. Robots.

[35]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[36]  Leslie Pack Kaelbling,et al.  Learning Static Object Segmentation from Motion Segmentation , 2005, AAAI.

[37]  N. Davies,et al.  AN EXPERIMENTAL STUDY OF CO-EVOLUTION BETWEEN THE CUCKOO, CUCULUS CANORUS, AND ITS HOSTS. I. HOST EGG DISCRIMINATION , 1989 .

[38]  Trevor Darrell,et al.  The Pyramid Match Kernel: Efficient Learning with Sets of Features , 2007, J. Mach. Learn. Res..

[39]  Martin Cooke,et al.  Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[40]  Hans P. Moravec Sensor Fusion in Certainty Grids for Mobile Robots , 1988, AI Mag..

[41]  W. Köhler The Mentality of Apes. , 2018, Nature.

[42]  Gary L. Drescher,et al.  Made-up minds - a constructivist approach to artificial intelligence , 1991 .

[43]  J. Kevin O'Regan,et al.  Is There Something Out There? Inferring Space from Sensorimotor Dependencies , 2003, Neural Computation.

[44]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[45]  S. P. Mudur,et al.  Three-dimensional computer vision: a geometric viewpoint , 1993 .

[46]  Timothy F. Cootes,et al.  3D Statistical Shape Models Using Direct Optimisation of Description Length , 2002, ECCV.

[47]  Richard T. Vaughan,et al.  The Player/Stage Project: Tools for Multi-Robot and Distributed Sensor Systems , 2003 .

[48]  D. Hubel,et al.  Segregation of form, color, movement, and depth: anatomy, physiology, and perception. , 1988, Science.

[49]  Benjamin Kuipers,et al.  Autonomous shape model learning for object localization and recognition , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[50]  E. P. Animal Behaviour , 1901, Nature.

[51]  P. Bloom Précis of How Children Learn the Meanings of Words , 2001, Behavioral and Brain Sciences.

[52]  Jane Van Lawick-Goodall,et al.  Tool-Using in Primates and Other Vertebrates , 1971 .

[53]  M. Kubovy,et al.  Auditory and visual objects , 2001, Cognition.

[54]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[55]  Jeffrey Mark Siskind,et al.  Grounding the Lexical Semantics of Verbs in Visual Perception using Force Dynamics and Event Logic , 1999, J. Artif. Intell. Res..

[56]  Yolanda Gil,et al.  Acquiring domain knowledge for planning by experimentation , 1992 .

[57]  Maja J. Mataric,et al.  Temporal occupancy grids: a method for classifying the spatio-temporal properties of the environment , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[58]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[59]  Benjamin Kuipers,et al.  The Spatial Semantic Hierarchy , 2000, Artif. Intell..

[60]  Richard Campbell,et al.  Object recognition for an intelligent room , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[61]  Konrad Paul Kording,et al.  Bayesian integration in sensorimotor learning , 2004, Nature.

[62]  Shimon Edelman,et al.  Representation and recognition in vision , 1999 .

[63]  Wilson S. Geisler,et al.  A Bayesian approach to the evolution of perceptual and cognitive systems , 2003, Cogn. Sci..

[64]  Sebastian Thrun,et al.  Recovering Articulated Object Models from 3D Range Data , 2004, UAI.

[65]  Lorenzo Natale,et al.  Linking Action to Perception in a Humanoid Robot: a Developmental Approach to Grasping , 2004 .

[66]  Jochen Triesch,et al.  GripSee: A Gesture-Controlled Robot for Object Perception and Manipulation , 1999, Auton. Robots.

[67]  Susan J. Lederman,et al.  The Intelligent Hand: An Experimental Approach to Human-Object Recognition and Implications for Robotics and AI , 1994, AI Mag..

[68]  Alessandro Saffiotti,et al.  Perceptual Anchoring of Symbols for Action , 2001, IJCAI.

[69]  Sebastian Thrun,et al.  Probabilistic robotics , 2002, CACM.

[70]  Francis Schmitt,et al.  Silhouette and stereo fusion for 3D object modeling , 2003, Fourth International Conference on 3-D Digital Imaging and Modeling, 2003. 3DIM 2003. Proceedings..

[71]  Pietro Perona,et al.  A Bayesian approach to unsupervised one-shot learning of object categories , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[72]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[73]  Melvyn A. Goodale,et al.  The cortical organization of visual perception and visuomotor control , 1995 .

[74]  A. Kacelnik,et al.  Shaping of Hooks in New Caledonian Crows , 2002, Science.

[75]  Benjamin Kuipers,et al.  Map Learning with Uninterpreted Sensors and Effectors , 1995, Artif. Intell..

[76]  J.R. Movellan,et al.  An Infomax Controller for Real Time Detection of Social Contingency , 2005, Proceedings. The 4nd International Conference on Development and Learning, 2005..

[77]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[78]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[79]  Mark H. Johnson,et al.  The “what” and “where” of object representations in infancy , 2003, Cognition.

[80]  Chrystopher L. Nehaniv,et al.  From unknown sensors and actuators to actions grounded in sensorimotor perceptions , 2006, Connect. Sci..

[81]  Luc Steels,et al.  Aibo''s first words. the social learning of language and meaning. Evolution of Communication , 2002 .

[82]  Chen Yu,et al.  The Role of Embodied Intention in Early Lexical Acquisition , 2005, Cogn. Sci..

[83]  Alexander Stoytchev,et al.  Behavior-Grounded Representation of Tool Affordances , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[84]  Stephen Hart,et al.  A Relational Representation for Procedural Task Knowledge , 2005, AAAI.

[85]  Mark Steedman Formalizing Affordance , 2019, Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society.

[86]  Maja J. Mataric,et al.  Deriving action and behavior primitives from human motion data , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[87]  Herbert A. Simon,et al.  Rule Creation and Rule Learning Through Environmental Exploration , 1989, IJCAI.

[88]  Scott P. Johnson,et al.  Motion and edge sensitivity in perception of object unity , 2003, Cognitive Psychology.

[89]  Rahul Sukthankar,et al.  The OD theory of TOD: the use and limits of temporal information for object discovery , 2002, AAAI/IAAI.

[90]  Salvatore Gaglio,et al.  Understanding dynamic scenes , 2000, Artif. Intell..

[91]  Ronald Parr,et al.  DP-SLAM: fast, robust simultaneous localization and mapping without predetermined landmarks , 2003, IJCAI 2003.

[92]  Steven M. LaValle,et al.  Randomized Kinodynamic Planning , 1999, Proceedings 1999 IEEE International Conference on Robotics and Automation (Cat. No.99CH36288C).

[93]  Scott Benson,et al.  Inductive Learning of Reactive Action Models , 1995, ICML.

[94]  Refractor Vision , 2000, The Lancet.

[95]  Christopher G. Atkeson,et al.  A comparison of direct and model-based reinforcement learning , 1997, Proceedings of International Conference on Robotics and Automation.

[96]  Risto Miikkulainen,et al.  Developing navigation behavior through self-organizing distinctive-state abstraction , 2006, Connect. Sci..

[97]  Paul R. Cohen,et al.  Searching for Planning Operators with Context-Dependent and Probabilistic Effects , 1996, AAAI/IAAI, Vol. 1.

[98]  Risto Miikkulainen,et al.  The constructivist learning architecture: a model of cognitive development for robust autonomous robots , 2004 .

[99]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[100]  Peter Stone,et al.  Towards autonomous sensor and actuator model induction on a mobile robot , 2006, Connect. Sci..

[101]  P. Schiller Innate constituents of complex responses in primates. , 1952, Psychological review.

[102]  Christoph von der Malsburg,et al.  Acquisition of visual shape primitives , 2002, Vision Research.

[103]  Brendan J. Frey,et al.  Learning flexible sprites in video layers , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[104]  J J Hopfield,et al.  Olfactory computation and object perception. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[105]  R. A. Brooks,et al.  Intelligence without Representation , 1991, Artif. Intell..

[106]  Gérard G. Medioni,et al.  Perceptual Grouping from Motion Cues Using Tensor Voting in 4-D , 2002, ECCV.

[107]  Leslie Pack Kaelbling,et al.  Learning Planning Rules in Noisy Stochastic Worlds , 2005, AAAI.

[108]  John J. Leonard,et al.  Directed Sonar Sensing for Mobile Robot Navigation , 1992 .