Tutor-based learning of visual categories using different levels of supervision

In recent years we have seen lots of strong work in visual recognition, dialogue interpretation and multi-modal learning that is targeted at provide the building blocks to enable intelligent robots to interact with humans in a meaningful way and even continuously evolve during this process. Building systems that unify those components under a common architecture has turned out to be challenging, as each approach comes with it's own set of assumptions, restrictions, and implications. For example, the impact of recent progress on visual category recognition has been limited from a perspective of interactive systems. Reasons for this are diverse. We identify and address two major challenges in order to integrate modern techniques for visual categorization in an interactive learning system: reducing the number of required labelled training examples and dealing with potentially erroneous input. Today's object categorization methods use either supervised or unsupervised training methods. While supervised methods tend to produce more accurate results, unsupervised methods are highly attractive due to their potential to use far more and unlabeled training data. We proposes a novel method that uses unsupervised training to obtain visual groupings of objects and a cross-modal learning scheme to overcome inherent limitations of purely unsupervised training. The method uses a unified and scale-invariant object representation that allows to handle labeled as well as unlabeled information in a coherent way. First experiments demonstrate the ability of the system to learn object category models from many unlabeled observations and a few dialogue interactions that can be ambiguous or even erroneous.

[1]  L. Steels The symbol grounding problem has been solved, so what’s next? , 2008 .

[2]  John D. Kelleher,et al.  Proximity in Context: An Empirically Grounded Computational Model of Proximity for Processing Topological Spatial Expressions , 2006, ACL.

[3]  Sven Wachsmuth,et al.  An integrated system for cooperative man-machine interaction , 2001, Proceedings 2001 IEEE International Symposium on Computational Intelligence in Robotics and Automation (Cat. No.01EX515).

[4]  Shimon Ullman,et al.  Cross-generalization: learning novel classes from a single example by feature replacement , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Stevan Harnad,et al.  Symbol grounding problem , 1990, Scholarpedia.

[6]  Bernt Schiele,et al.  Local features for object class recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[7]  Pierre Lison,et al.  Situated Dialogue Processing for Human-Robot Interaction , 2010, Cognitive Systems.

[8]  Deb K. Roy,et al.  Learning visually grounded words and syntax for a scene description task , 2002, Comput. Speech Lang..

[9]  Jason Baldridge,et al.  Coupling CCG and Hybrid Logic Dependency Semantics , 2002, ACL.

[10]  Pierre Lison,et al.  Salience-driven Contextual Priming of Speech Recognition for Human-Robot Interaction , 2008, ECAI.

[11]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[12]  Artur M. Arsénio On the Use of Cognitive Artifacts for Developmental Learning in a Humanoid Robot , 2004, ICONIP.

[13]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[14]  Pietro Perona,et al.  A Bayesian approach to unsupervised one-shot learning of object categories , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[15]  Nick Hawes,et al.  A System for Continuous Learning of Visual Concepts , 2007 .

[16]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[17]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[18]  Heiko Wersing,et al.  Rapid Online Learning of Objects in a Biologically Motivated Recognition Architecture , 2005, DAGM-Symposium.

[19]  John D. Kelleher,et al.  Information Fusion for Visual Reference Resolution in Dynamic Situated Dialogue , 2006, PIT.

[20]  Andrea L. Thomaz,et al.  Socially guided machine learning , 2006 .

[21]  Pierre Lison,et al.  An Integrated Approach to Robust Processing of Situated Spoken Dialogue , 2009 .

[22]  Bernt Schiele,et al.  How Good are Local Features for Classes of Geometric Objects , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[23]  Luc Steels,et al.  Aibo''s first words. the social learning of language and meaning. Evolution of Communication , 2002 .

[24]  Bernt Schiele,et al.  Towards Unsupervised Discovery of Visual Categories , 2006, DAGM-Symposium.

[25]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[26]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[27]  Trevor Darrell,et al.  Unsupervised Learning of Categories from Sets of Partially Matching Image Features , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[28]  Bernt Schiele,et al.  Robust Object Detection with Interleaved Categorization and Segmentation , 2008, International Journal of Computer Vision.

[29]  Kristoffer Sjöö,et al.  Planning as an architectural control mechanism , 2009, 2009 4th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[30]  Trevor Darrell,et al.  Active Learning with Gaussian Processes for Object Categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[31]  Bernt Schiele,et al.  Integrating representative and discriminant models for object category detection , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[32]  Jason Baldridge,et al.  Multi-Modal Combinatory Categorial Grammar , 2003, EACL.

[33]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Ankur Agarwal,et al.  Hyperfeatures - Multilevel Local Coding for Visual Recognition , 2006, ECCV.

[35]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[36]  Haim J. Wolfson,et al.  Geometric hashing: an overview , 1997 .

[37]  A.M. Arsenic,et al.  Developmental learning on a humanoid robot , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[38]  John D. Kelleher,et al.  Structural descriptions in human-assisted robot visual learning , 2006, HRI '06.

[39]  Pietro Perona,et al.  Bayesian reasoning on qualitative descriptions from images and speech , 2000, Image Vis. Comput..

[40]  Nick Hawes,et al.  Crossmodal content binding in information-processing architectures , 2008, 2008 3rd ACM/IEEE International Conference on Human-Robot Interaction (HRI).