LEARNING FROM IMAGES WITH CAPTIONS USING THE MAXIMUM MARGIN SET ALGORITHM

A large amount of images with accompanying text captions are available on the Internet. These are valuable for training visual classifiers without any explicit manual intervention. In this paper, we present a general framework to address this problem. Under this new framework, each training image is represented as a bag of regions, associated with a set of candidate labeling vectors. Each labeling vector encodes the possible labels for the regions of the image. The set of all possible labeling vectors can be generated automatically from the caption using natural language processing techniques. The use of labeling vectors provides a principled way to include diverse information from the captions, such as multiple types of words corresponding to different attributes of the same image region, labeling constraints derived from grammatical connections between words, uniqueness constraints, and spatial position indicators. Moreover, it can also be used to incorporate high-level domain knowledge useful for improving learning performance. We show that learning is possible under this weakly supervised setup. Exploiting this property of the problem, we propose a large margin discriminative formulation, and an efficient algorithm to solve the proposed learning problem. Experiments conducted on artificial datasets and two real-world images and captions datasets support our claims.

[1]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[2]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[3]  Zhi-Hua Zhou,et al.  Multi-Instance Multi-Label Learning with Application to Scene Classification , 2006, NIPS.

[4]  Gang Wang,et al.  Joint learning of visual attributes, object classes and visual saliency , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[5]  Yann Rodriguez,et al.  Face detection and verification using local binary patterns , 2006 .

[6]  Andrew Zisserman,et al.  Progressive search space reduction for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[9]  Barbara Caputo,et al.  Who's Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation , 2009, NIPS.

[10]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[11]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[14]  Rong Jin,et al.  Learning with Multiple Labels , 2002, NIPS.

[15]  Larry S. Davis,et al.  Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers , 2008, ECCV.

[16]  Andrew Zisserman,et al.  Pose search: Retrieving people using their pose , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Zhi-Hua Zhou,et al.  M3MIML: A Maximum Margin Method for Multi-instance Multi-label Learning , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[18]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[19]  Cordelia Schmid,et al.  Multiple Instance Metric Learning from Automatically Labeled Bags of Faces , 2010, ECCV.

[20]  Vittorio Ferrari,et al.  Better Appearance Models for Pictorial Structures , 2009, BMVC.

[21]  Francesco Orabona,et al.  Learning from Candidate Labeling Sets , 2010, NIPS.

[22]  Katja Markert,et al.  Learning Models for Object Recognition from Natural Language Descriptions , 2009, BMVC.

[23]  Razvan C. Bunescu,et al.  Multiple instance learning for sparse positive bags , 2007, ICML '07.

[24]  Thorsten Joachims,et al.  Training structural SVMs when exact inference is intractable , 2008, ICML '08.

[25]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[26]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[27]  Yang Wang,et al.  A Discriminative Latent Model of Image Region and Object Tag Correspondence , 2010, NIPS.

[28]  Cordelia Schmid,et al.  Automatic face naming with caption-based supervision , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  F. Quimby What's in a picture? , 1993, Laboratory animal science.

[30]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[31]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[33]  B. Taskar,et al.  Learning from ambiguously labeled images , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[35]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[36]  Marie-Francine Moens,et al.  Semi-supervised Semantic Role Labeling Using the Latent Words Language Model , 2009, EMNLP.

[37]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[38]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[39]  Thomas Hofmann,et al.  Kernel Methods for Missing Variables , 2005, AISTATS.

[40]  A. Banerjee Convex Analysis and Optimization , 2006 .