Visual Query Expansion via Incremental Hypernetwork Models of Image and Text

Humans can associate vision and language modalities and thus generate mental imagery, i.e. visual images, from linguistic input in an environment of unlimited inflowing information. Inspired by human memory, we separate a text-to-image retrieval task into two steps: 1) text-to-image conversion (generating visual queries for the 2 step) and 2) image-to-image retrieval task. This separation is advantageous for inner representation visualization, learning incremental dataset, using the results of content-based image retrieval. Here, we propose a visual query expansion method that simulates the capability of human associative memory. We use a hyperenetwork model (HN) that combines visual words and linguistic words. HNs learn the higher-order cross-modal associative relationships incrementally on a set of image-text pairs in sequence. An incremental HN generates images by assembling visual words based on linguistic cues. And we retrieve similar images with the generated visual query. The method is evaluated on 26 video clips of 'Thomas and Friends'. Experiments show the performance of successive image retrieval rate up to 98.1% with a single text cue. It shows the additional potential to generate the visual query with several text cues simultaneously.

[1]  Olivier Buisson,et al.  Logo retrieval with a contrario visual query expansion , 2009, ACM Multimedia.

[2]  Byoung-Tak Zhang,et al.  Text-to-Image Cross-Modal Retrieval of Magazine Articles Based on Higher-order Pattern Recall by Hypernetworks , 2009 .

[3]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[4]  Dimitrios Papadias,et al.  A Representation Scheme for Computational Imagery , 1991 .

[5]  Daniel A. Keim,et al.  On Knowledge Discovery and Data Mining , 1997 .

[6]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[7]  Michael Isard,et al.  Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[9]  GeversTheo,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010 .

[10]  Deb Roy,et al.  Mental imagery for a conversational robot , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[11]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[12]  Byoung-Tak Zhang,et al.  Hypernetworks: A Molecular Evolutionary Architecture for Cognitive Learning and Memory , 2008, IEEE Computational Intelligence Magazine.

[13]  Christos Faloutsos,et al.  Automatic multimedia cross-modal correlation discovery , 2004, KDD.

[14]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[15]  C. Allen,et al.  Stanford Encyclopedia of Philosophy , 2011 .

[16]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  D. Papadias,et al.  Computational Imagery , 1992, Cogn. Sci..

[18]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[19]  Rong Yan,et al.  A review of text and image retrieval approaches for broadcast news video , 2007, Information Retrieval.

[20]  Chong-Wah Ngo,et al.  Bag-of-visual-words expansion using visual relatedness for video indexing , 2008, SIGIR '08.