VSE-ens: Visual-Semantic Embeddings with Efficient Negative Sampling

Jointing visual-semantic embeddings (VSE) have become a research hotpot for the task of image annotation, which suffers from the issue of semantic gap, i.e., the gap between images' visual features (low-level) and labels' semantic features (high-level). This issue will be even more challenging if visual features cannot be retrieved from images, that is, when images are only denoted by numerical IDs as given in some real datasets. The typical way of existing VSE methods is to perform a uniform sampling method for negative examples that violate the ranking order against positive examples, which requires a time-consuming search in the whole label space. In this paper, we propose a fast adaptive negative sampler that can work well in the settings of no figure pixels available. Our sampling strategy is to choose the negative examples that are most likely to meet the requirements of violation according to the latent factors of images. In this way, our approach can linearly scale up to large datasets. The experiments demonstrate that our approach converges 5.02x faster than the state-of-the-art approaches on OpenImages, 2.5x on IAPR-TCI2 and 2.06x on NUS-WIDE datasets, as well as better ranking accuracy across datasets.

[1]  Lars Schmidt-Thieme,et al.  BPR: Bayesian Personalized Ranking from Implicit Feedback , 2009, UAI.

[2]  Chong Wang,et al.  Latent Collaborative Retrieval , 2012, ICML.

[3]  Witold Pedrycz,et al.  Neighborhood rough sets based multi-label classification for automatic image annotation , 2013, Int. J. Approx. Reason..

[4]  Alan L. Yuille,et al.  Multi-Instance Visual-Semantic Embedding , 2015, BMVC 2017.

[5]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[6]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[7]  Steffen Rendle,et al.  Improving pairwise learning for item recommendation from implicit feedback , 2014, WSDM.

[8]  Weinan Zhang,et al.  BoostFM: Boosted Factorization Machines for Top-N Feature-based Recommendation , 2017, IUI.

[9]  Bin Li,et al.  Semi-automatic dynamic auxiliary-tag-aided image annotation , 2010, Pattern Recognit..

[10]  Zhi-Hua Zhou,et al.  On the Consistency of AUC Pairwise Optimization , 2012, IJCAI.

[11]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[12]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[13]  Alberto Del Bimbo,et al.  Automatic image annotation via label transfer in the semantic space , 2016, Pattern Recognit..

[14]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[15]  C. V. Jawahar,et al.  Image Annotation Using Metric Learning in Semantic Neighbourhoods , 2012, ECCV.

[16]  David J. Fleet,et al.  VSE++: Improved Visual-Semantic Embeddings , 2017, ArXiv.

[17]  Alfred O. Hero,et al.  Social Collaborative Retrieval , 2014, IEEE Journal of Selected Topics in Signal Processing.

[18]  Oge Marques,et al.  Semi-automatic Semantic Annotation of Images Using Machine Learning Techniques , 2003, SEMWEB.

[19]  Marie-Francine Moens,et al.  Text Analysis for Automatic Image Annotation , 2007, ACL.

[20]  Xiaoli Li,et al.  Rank-GeoFM: A Ranking based Geographical Factorization Method for Point of Interest Recommendation , 2015, SIGIR.

[21]  Weinan Zhang,et al.  LambdaFM: Learning Optimal Ranking with Factorization Machines Using Lambda Surrogates , 2016, CIKM.