An automatic image-text alignment method for large-scale web image retrieval

For reducing huge uncertainty on the relatedness between the web images and their auxiliary text terms, an automatic image-text alignment algorithm is developed to achieve more accurate indexing and retrieval of large-scale web images by assigning the web images into their most relevant visual text terms precisely. First, large-scale web pages are crawled, where the informative images and their most relevant auxiliary text blocks are extracted. Second, parallel image clustering is performed to partition large-scale informative web images into a large number of clusters. By grouping the visually-similar web images into the same cluster, our parallel image clustering algorithm can significantly reduce the huge uncertainty on the relatedness between the web images and their auxiliary text terms, which can provide a good starting point for supporting automatic image-text alignment. Finally, a relevance re-ranking algorithm is developed to identify the most relevant text terms for characterizing the semantics of the visually-similar web images in the same cluster, e.g., assigning the web images into their most relevant visual text terms. Our experiments on large-scale web images have obtained very positive results.

[1]  Gunhee Kim,et al.  Joint photo stream and blog post summarization and exploration , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Shih-Fu Chang,et al.  Video search reranking via information bottleneck principle , 2006, MM '06.

[3]  Sven J. Dickinson,et al.  Using Language to Learn Structured Appearance Models for Image Annotation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[5]  Daniel Gatica-Perez,et al.  Modeling Semantic Aspects for Cross-Media Image Indexing , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[7]  Marco La Cascia,et al.  Unifying Textual and Visual Cues for Content-Based Image Retrieval on the World Wide Web , 1999, Comput. Vis. Image Underst..

[8]  Xian-Sheng Hua,et al.  Finding image exemplars using fast sparse affinity propagation , 2008, ACM Multimedia.

[9]  Fei-Fei Li,et al.  Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Brendan J. Frey,et al.  Hierarchical Affinity Propagation , 2011, UAI.

[11]  Nenghai Yu,et al.  Flickr Distance: A Relationship Measure for Visual Concepts , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[13]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[14]  Yi Yang,et al.  Image Attribute Adaptation , 2014, IEEE Transactions on Multimedia.

[15]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Fei-Fei Li,et al.  Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, CVPR.

[17]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[18]  Chung-Hsien Wu,et al.  Unsupervised Alignment of News Video and Text Using Visual Patterns and Textual Concepts , 2011, IEEE Transactions on Multimedia.

[19]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[20]  Shumeet Baluja,et al.  VisualRank: Applying PageRank to Large-Scale Image Search , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Wei-Ying Ma,et al.  Multi-model similarity propagation and its application for web image retrieval , 2004, MULTIMEDIA '04.

[22]  Jason Weston,et al.  Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Wei-Ying Ma,et al.  Hierarchical clustering of WWW image search results using visual, textual and link information , 2004, MULTIMEDIA '04.

[25]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[26]  Tao Qin,et al.  Web image clustering by consistent utilization of visual features and surrounding texts , 2005, MULTIMEDIA '05.

[27]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[28]  Xiaogang Wang,et al.  Web Image Re-Ranking UsingQuery-Specific Semantic Signatures , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[30]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[31]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  James Ze Wang,et al.  Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  R. Manmatha,et al.  A Model for Learning the Semantics of Pictures , 2003, NIPS.

[34]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Yasuhiro Fujiwara,et al.  Fast Algorithm for Affinity Propagation , 2011, IJCAI.

[36]  Xi Liu,et al.  Learning image semantics with latent aspect model , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[37]  Shih-Fu Chang,et al.  Video search reranking through random walk over document-level context graph , 2007, ACM Multimedia.

[38]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, CVPR.

[39]  Hung-Khoon Tan,et al.  Modeling video hyperlinks with hypergraph for web video reranking , 2008, ACM Multimedia.

[40]  Changhu Wang,et al.  Image annotation refinement using random walk with restarts , 2006, MM '06.

[41]  Qi Tian,et al.  Image Annotation by Input–Output Structural Grouping Sparsity , 2012, IEEE Transactions on Image Processing.

[42]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[43]  Trevor Darrell,et al.  Learning Visual Representations using Images with Captions , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Li Fei-Fei,et al.  Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[46]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[47]  Yansong Feng,et al.  Automatic Caption Generation for News Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[49]  Dong Liu,et al.  Tag ranking , 2009, WWW '09.

[50]  James M. Rehg,et al.  Learning Query-Specific Distance Functions for Large-Scale Web Image Search , 2013, IEEE Transactions on Multimedia.

[51]  Rongrong Ji,et al.  Weakly Supervised Multi-Graph Learning for Robust Image Reranking , 2014, IEEE Transactions on Multimedia.

[52]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Marie-Francine Moens,et al.  Cross-Media Alignment of Names and Faces , 2010, IEEE Transactions on Multimedia.

[54]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[55]  LeCunYann,et al.  Learning Hierarchical Features for Scene Labeling , 2013 .

[56]  Jianping Fan,et al.  Parallel AP Clustering and Re-ranking for Automatic Image-Text Alignment and Large-Scale Web Image Search , 2015, ICMR.

[57]  Gustavo Carneiro,et al.  Supervised Learning of Semantic Classes for Image Annotation and Retrieval , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[60]  Xian-Sheng Hua,et al.  Video search re-ranking via multi-graph propagation , 2007, ACM Multimedia.

[61]  Yi Yang,et al.  Augmenting Image Descriptions Using Structured Prediction Output , 2014, IEEE Transactions on Multimedia.

[62]  David A. Forsyth,et al.  Whos In the Picture , 2004, NIPS.

[63]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[64]  Y. Mori,et al.  Image-to-word transformation based on dividing and vector quantizing images with words , 1999 .