Cross‐modal Learning of Visual Categories using Different Levels of Supervision

Today's object categorization methods use either supervised or unsupervised training methods. While supervised methods tend to produce more accurate results, unsupervised methods are highly attrac- tive due to their potential to use far more and unlabeled training data. This paper proposes a novel method that uses unsupervised training to obtain visual groupings of objects and a cross-modal learning scheme to overcome inherent limitations of purely unsupervised training. The method uses a unified and scale-invariant object representation that al- lows to handle labeled as well as unlabeled information in a coherent way. One of the potential settings is to learn object category models from many unlabeled observations and a few dialogue interactions that can be ambiguous or even erroneous. First experiments demonstrate the ability of the system to learn meaningful generalizations across objects already from a few dialogue interactions.

[1]  Bernt Schiele,et al.  Local features for object class recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[2]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[3]  John D. Kelleher,et al.  Proximity in Context: An Empirically Grounded Computational Model of Proximity for Processing Topological Spatial Expressions , 2006, ACL.

[4]  Jason Baldridge,et al.  Coupling CCG and Hybrid Logic Dependency Semantics , 2002, ACL.

[5]  Ankur Agarwal,et al.  Hyperfeatures - Multilevel Local Coding for Visual Recognition , 2006, ECCV.

[6]  Bernt Schiele,et al.  Integrating representative and discriminant models for object category detection , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[7]  Bernt Schiele,et al.  Towards Unsupervised Discovery of Visual Categories , 2006, DAGM-Symposium.

[8]  Bernt Schiele,et al.  Pedestrian detection in crowded scenes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Jason Baldridge,et al.  Multi-Modal Combinatory Categorial Grammar , 2003, EACL.

[10]  John D. Kelleher,et al.  Structural descriptions in human-assisted robot visual learning , 2006, HRI '06.

[11]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  John D. Kelleher,et al.  Information Fusion for Visual Reference Resolution in Dynamic Situated Dialogue , 2006, PIT.

[13]  Trevor Darrell,et al.  Unsupervised Learning of Categories from Sets of Partially Matching Image Features , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[14]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[15]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[16]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[17]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.