Automatic image annotation and retrieval using cross-media relevance models

Libraries have traditionally used manual image annotation for indexing and then later retrieving their image collections. However, manual image annotation is an expensive and labor intensive procedure and hence there has been great interest in coming up with automatic ways to retrieve images based on content. Here, we propose an automatic approach to annotating and retrieving images based on a training set of images. We assume that regions in an image can be described using a small vocabulary of blobs. Blobs are generated from image features using clustering. Given a training set of images with annotations, we show that probabilistic models allow us to predict the probability of generating a word given the blobs in an image. This may be used to automatically annotate and retrieve images given a word as a query. We show that relevance models allow us to derive these probabilities in a natural way. Experiments show that the annotation performance of this cross-media relevance model is almost six times as good (in terms of mean precision) than a model based on word-blob co-occurrence model and twice as good as a state of the art model derived from machine translation. Our approach shows the usefulness of using formal information retrieval models for the task of image annotation and retrieval.

[1]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[2]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Edward M. Riseman,et al.  Indexing flowers by color names using domain knowledge-driven segmentation , 1998, Proceedings Fourth IEEE Workshop on Applications of Computer Vision. WACV'98 (Cat. No.98EX201).

[4]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[5]  Edward M. Riseman,et al.  Indexing Flower Patent Images Using Domain Knowledge , 1999, IEEE Intell. Syst..

[6]  Y. Mori,et al.  Image-to-word transformation based on dividing and vector quantizing images with words , 1999 .

[7]  Jitendra Malik,et al.  Blobworld: A System for Region-Based Image Indexing and Retrieval , 1999, VISUAL.

[8]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[9]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[10]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[11]  Jean Ponce,et al.  Computer Vision: A Modern Approach , 2002 .

[12]  W. Bruce Croft Combining Approaches to Information Retrieval , 2002 .

[13]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[14]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[15]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[16]  Tom Minka,et al.  Vision texture for annotation , 1995, Multimedia Systems.