A Model for Learning the Semantics of Pictures

We propose an approach to learning the semantics of images which allows us to automatically annotate an image with keywords and to retrieve images based on text queries. We do this using a formalism that models the generation of annotated images. We assume that every image is divided into regions, each described by a continuous-valued feature vector. Given a training set of images with annotations, we compute a joint probabilistic model of image features and words which allow us to predict the probability of generating a word given the image regions. This may be used to automatically annotate and retrieve images given a word as a query. Experiments show that our model significantly outperforms the best of the previously reported results on the tasks of automatic image annotation and retrieval.

[1]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Y. Mori,et al.  Image-to-word transformation based on dividing and vector quantizing images with words , 1999 .

[3]  Takeo Kanade,et al.  A statistical method for 3D object detection applied to faces and cars , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[4]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[5]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[6]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[7]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[8]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..