Topic regression multi-modal Latent Dirichlet Allocation for image annotation

We present topic-regression multi-modal Latent Dirich-let Allocation (tr-mmLDA), a novel statistical topic model for the task of image and video annotation. At the heart of our new annotation model lies a novel latent variable regression approach to capture correlations between image or video features and annotation texts. Instead of sharing a set of latent topics between the 2 data modalities as in the formulation of correspondence LDA in [2], our approach introduces a regression module to correlate the 2 sets of topics, which captures more general forms of association and allows the number of topics in the 2 data modalities to be different. We demonstrate the power of tr-mmLDA on 2 standard annotation datasets: a 5000-image subset of COREL and a 2687-image LabelMe dataset. The proposed association model shows improved performance over correspondence LDA as measured by caption perplexity.

[1]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[2]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[3]  Luc Van Gool,et al.  Modeling scenes with local descriptors and latent aspects , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[4]  R. Manmatha,et al.  A Model for Learning the Semantics of Pictures , 2003, NIPS.

[5]  James Ze Wang,et al.  Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Hagai Attias,et al.  Independent factor topic models , 2009, ICML '09.

[7]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, CVPR.

[8]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[9]  Gustavo Carneiro,et al.  Formulating semantic image annotation as a supervised learning problem , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Cordelia Schmid,et al.  Coloring Local Feature Extraction , 2006, ECCV.

[12]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[13]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[14]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.