Automatic image-text alignment for large-scale web image indexing and retrieval

In this paper, an automatic image-text alignment algorithm is developed to achieve more effective indexing and retrieval of large-scale web images by aligning web images with their most relevant auxiliary text terms or phrases. First, a large number of cross-media web pages (which contain web images and their auxiliary texts) are crawled and segmented into a set of image-text pairs (informative web images and their associated text terms or phrases). Second, near-duplicate image clustering is used to group large-scale web images into a set of clusters of near-duplicate images according to their visual similarities. The near-duplicate web images in the same cluster share similar semantics and are simultaneously associated with a same or similar set of auxiliary text terms or phrases which co-occur frequently in the relevant text blocks, thus performing near-duplicate image clustering can significantly reduce the uncertainty on the relatedness between the semantics of web images and their auxiliary text terms or phrases. Finally, random walk is performed over a phrase correlation network to achieve more precise image-text alignment by refining the relevance scores between the web images and their auxiliary text terms or phrases. Our experiments on algorithm evaluation have achieved very positive results on large-scale cross-media web pages. HighlightsAn image-text alignment algorithm was developed for web image indexing and retrieval.Image clustering was used to better align the semantics of the Web image and text.A phrase-correlation network was constructed to characterize their relationship.Random walk was performed to achieve more precise image-text alignment.

[1]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[2]  Daniel Gatica-Perez,et al.  Modeling Semantic Aspects for Cross-Media Image Indexing , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Antonio Criminisi,et al.  Harvesting Image Databases from the Web , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[4]  Jianping Fan,et al.  Harvesting large-scale weakly-tagged image databases from the web , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  James Ze Wang,et al.  Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, CVPR 2004.

[8]  Marie-Francine Moens,et al.  Cross-Media Alignment of Names and Faces , 2010, IEEE Transactions on Multimedia.

[9]  Yansong Feng,et al.  Automatic Image Annotation Using Auxiliary Text Information , 2008, ACL.

[10]  Tao Mei,et al.  CrowdReranking: exploring multiple search engines for visual search reranking , 2009, SIGIR.

[11]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[12]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[13]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[14]  Alexander C. Berg,et al.  Who's In the Picture , 2004, NIPS 2004.

[15]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[16]  D. Saari Explaining All Three-Alternative Voting Outcomes , 1999 .

[17]  Latifur Khan,et al.  Image annotations by combining multiple evidence & wordNet , 2005, ACM Multimedia.

[18]  Edward Y. Chang,et al.  CBSA: content-based soft annotation for multimodal image retrieval using Bayes point machines , 2003, IEEE Trans. Circuits Syst. Video Technol..

[19]  Wei-Ying Ma,et al.  Hierarchical clustering of WWW image search results using visual, textual and link information , 2004, MULTIMEDIA '04.

[20]  David A. Forsyth,et al.  Whos In the Picture , 2004, NIPS.

[21]  Shuicheng Yan,et al.  Inferring semantic concepts from community-contributed images and noisy tags , 2009, ACM Multimedia.

[22]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[23]  Jianping Fan,et al.  JustClick: Personalized Image Recommendation via Exploratory Search From Large-Scale Flickr Images , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Sandip Debnath,et al.  Automatic identification of informative sections of Web pages , 2005, IEEE Transactions on Knowledge and Data Engineering.

[25]  Shih-Fu Chang,et al.  Video search reranking through random walk over document-level context graph , 2007, ACM Multimedia.

[26]  Shih-Fu Chang,et al.  Video search reranking via information bottleneck principle , 2006, MM '06.

[27]  Wei-Ying Ma,et al.  A Probabilistic Semantic Model for Image Annotation and Multi-Modal Image Retrieva , 2005, ICCV.

[28]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[29]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[30]  Xian-Sheng Hua,et al.  Video search re-ranking via multi-graph propagation , 2007, ACM Multimedia.

[31]  Rada Mihalcea,et al.  Explorations in Automatic Image Annotation using Textual Features , 2009, Linguistic Annotation Workshop.

[32]  Yong Wang,et al.  Refining image annotation using contextual relations between words , 2007, CIVR '07.

[33]  Shumeet Baluja,et al.  VisualRank: Applying PageRank to Large-Scale Image Search , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Wei-Ying Ma,et al.  Multi-model similarity propagation and its application for web image retrieval , 2004, MULTIMEDIA '04.

[35]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[36]  Marcel Worring,et al.  Learning Social Tag Relevance by Neighbor Voting , 2009, IEEE Transactions on Multimedia.

[37]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[38]  Changhu Wang,et al.  Image annotation refinement using random walk with restarts , 2006, MM '06.

[39]  Jianping Fan,et al.  Quantitative Characterization of Semantic Gaps for Learning Complexity Estimation and Inference Model Selection , 2012, IEEE Transactions on Multimedia.

[40]  Tao Qin,et al.  Web image clustering by consistent utilization of visual features and surrounding texts , 2005, MULTIMEDIA '05.

[41]  Jiri Matas,et al.  Geometric min-Hashing: Finding a (thick) needle in a haystack , 2009, CVPR.

[42]  R. Manmatha,et al.  A Model for Learning the Semantics of Pictures , 2003, NIPS.

[43]  Jianping Fan,et al.  Mining Multilevel Image Semantics via Hierarchical Classification , 2008, IEEE Transactions on Multimedia.

[44]  Dong Liu,et al.  Tag ranking , 2009, WWW '09.

[45]  Wei-Ying Ma,et al.  Improving pseudo-relevance feedback in web information retrieval using web page segmentation , 2003, WWW '03.

[46]  Gang Wang,et al.  OPTIMOL: automatic Online Picture collecTion via Incremental MOdel Learning , 2007, CVPR.