Multimodal Image Retrieval

In this work we extend the standard singlelayer probabilistic Latent Semantic Analysis (pLSA) [16] to multiple layers. As multiple layers should naturally handle multiple modalities and a hierarchy of abstractions, we denote this new approach multilayer multimodal probabilistic Latent Semantic Analysis (mmpLSA). We derive the training and inference rules for the smallest possible non-degenerated mm-pLSA model: a model with two leaf-pLSAs and a single top-level pLSA node merging the two leaf-pLSAs. We evaluate this approach on two pairs of different modalities: SIFT features and image annotations (tags) as well as the combination of SIFT and HOG features. We also propose a fast and strictly stepwise forward procedure to initialize bottom-up the mm-pLSA model, which in turn can then be post-optimized by the general mm-pLSA learning algorithm. The proposed approach is evaluated in a query-by-example retrieval task where various variants of our mm-pLSA system are compared to systems relying on a single modality and other adhoc combinations of feature histograms. We further describe possible pitfalls of the mm-pLSA training and analyze the resulting model yielding an intuitive explanation of its behaviour.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Rainer Lienhart,et al.  Deep networks for image retrieval on large-scale databases , 2008, ACM Multimedia.

[3]  Eli Shechtman,et al.  Matching Local Self-Similarities across Images and Videos , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  J. Hawkins,et al.  On Intelligence , 2004 .

[5]  Alexei A. Efros,et al.  Unsupervised discovery of visual object class hierarchies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Daniel Gatica-Perez,et al.  PLSA-based image auto-annotation: constraining the latent space , 2004, MULTIMEDIA '04.

[7]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[9]  Andrew Zisserman,et al.  Scene Classification Via pLSA , 2006, ECCV.

[10]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[11]  Rainer Lienhart,et al.  Multilayer pLSA for multimodal image retrieval , 2009, CIVR '09.

[12]  Rainer Lienhart,et al.  Multimodal pLSA on visual features and tags , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[15]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[16]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[17]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[18]  R. Lienhart,et al.  Continuous visual vocabulary modelsfor pLSA-based scene recognition , 2008, CIVR '08.

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[21]  R. Lienhart,et al.  Correlated Topic Models for Image Retrieval , 2008 .

[22]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[24]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Jitendra Malik,et al.  Shape matching and object recognition using low distortion correspondences , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[27]  Ling Guan,et al.  Multi-feature pLSA for combining visual features in image annotation , 2011, ACM Multimedia.

[28]  Mor Naaman,et al.  How flickr helps us make sense of the world: context and content in community-contributed media collections , 2007, ACM Multimedia.

[29]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[30]  Rainer Lienhart,et al.  Image retrieval on large-scale image databases , 2007, CIVR '07.

[31]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Rainer Lienhart,et al.  PLSA on Large Scale Image Databases , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.