Multiple Bernoulli relevance models for image and video annotation

Retrieving images in response to textual queries requires some knowledge of the semantics of the picture. Here, we show how we can do both automatic image annotation and retrieval (using one word queries) from images and videos using a multiple Bernoulli relevance model. The model assumes that a training set of images or videos along with keyword annotations is provided. Multiple keywords are provided for an image and the specific correspondence between a keyword and an image is not provided. Each image is partitioned into a set of rectangular regions and a real-valued feature vector is computed over these regions. The relevance model is a joint probability distribution of the word annotations and the image feature vectors and is computed using the training set. The word probabilities are estimated using a multiple Bernoulli model and the image feature probabilities using a non-parametric kernel density estimate. The model is then used to annotate images in a test set. We show experiments on both images from a standard Corel data set and a set of video key frames from NIST's video tree. Comparative experiments show that the model performs better than a model based on estimating word probabilities using the popular multinomial distribution. The results also show that our model significantly outperforms previously reported results on the task of image and video annotation.

[1]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Tomaso A. Poggio,et al.  Example-Based Learning for View-Based Human Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Takeo Kanade,et al.  A statistical method for 3D object detection applied to faces and cars , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[4]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[5]  Dan Roth,et al.  Learning a Sparse Representation for Object Detection , 2002, ECCV.

[6]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[7]  Peter Carbonetto,et al.  Why can’t José read? The problem of learning semantic associations in a robot environment , 2003, HLT-NAACL 2003.

[8]  R. Manmatha,et al.  A Model for Learning the Semantics of Pictures , 2003, NIPS.

[9]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[10]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[11]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[12]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[13]  A. P. deVries,et al.  Experimental evaluation of a generative probabilistic image retrieval model on 'easy' data , 2003 .