Translating topics to words for image annotation

One of the classic techniques for image annotation is the language translation model. It views an image as a document, i.e., a set of visual words which are obtained by vector quatitizing the image regions generated by unsupervised image segmentation. Annotating images are achieved by translating visual words to textual words, just like translating a document in English to a document in French. In this paper, we also view an image as a document, but we view the annotation processes as two consecutive processes, i.e., document summarization and translation. In the document summarization process, an image document is firstly summarized into its own visual language, which we called visual topics. The translation process translates these visual topics to textual words. Compared to the original translation model, our visual topics learned by the probabilistic latent semantic analysis (PLSA) approach provide an intermediate abstract level of visual description. We show improved annotation performance on the Corel image dataset.

[1]  Edward Y. Chang,et al.  CBSA: content-based soft annotation for multimodal image retrieval using Bayes point machines , 2003, IEEE Trans. Circuits Syst. Video Technol..

[2]  James Ze Wang,et al.  Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[4]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[5]  Gustavo Carneiro,et al.  Formulating semantic image annotation as a supervised learning problem , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[6]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[7]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[8]  R. Manmatha,et al.  A Model for Learning the Semantics of Pictures , 2003, NIPS.

[9]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..