Separating Self-Expression and Visual Content in Hashtag Supervision

The variety, abundance, and structured nature of hashtags make them an interesting data source for training vision models. For instance, hashtags have the potential to significantly reduce the problem of manual supervision and annotation when learning vision models for a large number of concepts. However, a key challenge when learning from hashtags is that they are inherently subjective because they are provided by users as a form of self-expression. As a consequence, hashtags may have synonyms (different hashtags referring to the same visual content) and may be polysemous (the same hashtag referring to different visual content). These challenges limit the effectiveness of approaches that simply treat hashtags as image-label pairs. This paper presents an approach that extends upon modeling simple image-label pairs with a joint model of images, hashtags, and users. We demonstrate the efficacy of such approaches in image tagging and retrieval experiments, and show how the joint model can be used to perform user-conditional retrieval and tagging.

[1]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[2]  R. Harshman,et al.  PARAFAC: parallel factor analysis , 1994 .

[3]  Joshua B. Tenenbaum,et al.  Separating Style and Content , 1996, NIPS.

[4]  Demetri Terzopoulos,et al.  Multilinear Analysis of Image Ensembles: TensorFaces , 2002, ECCV.

[5]  M. Alex O. Vasilescu Human motion signatures: analysis, synthesis, recognition , 2002, Object recognition supported by user interaction for service robots.

[6]  Tamir Hazan,et al.  Non-negative tensor factorization with applications to statistics and computer vision , 2005, ICML.

[7]  Mor Naaman,et al.  Towards automatic extraction of event and place semantics from flickr tags , 2007, SIGIR.

[8]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[9]  Jason Weston,et al.  Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.

[10]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[11]  Steffen Rendle,et al.  Factorization Machines , 2010, 2010 IEEE International Conference on Data Mining.

[12]  Tamara G. Kolda,et al.  Temporal Link Prediction Using Matrix and Tensor Factorizations , 2010, TKDD.

[13]  Hans-Peter Kriegel,et al.  A Three-Way Model for Collective Learning on Multi-Relational Data , 2011, ICML.

[14]  Fabio Cuzzolin,et al.  Fisher Tensor Decomposition for Unconstrained Gait Recognition , 2013 .

[15]  Wesley De Neve,et al.  Using topic models for Twitter hashtag recommendation , 2013, WWW.

[16]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[17]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[18]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[19]  Daniel Jurafsky,et al.  Do Multi-Sense Embeddings Improve Natural Language Understanding? , 2015, EMNLP.

[20]  Emily Denton,et al.  User Conditional Hashtag Prediction for Images , 2015, KDD.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[23]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  François Chollet Information-theoretical label embeddings for large-scale image classification , 2016, ArXiv.

[25]  Allan Jabri,et al.  Learning Visual Features from Large Weakly Supervised Data , 2015, ECCV.

[26]  Michael D Byrne,et al.  Comparing vector-based and Bayesian memory models using large-scale datasets: User-generated hashtag and tag prediction on Twitter and Stack Overflow. , 2016, Psychological methods.

[27]  Frédo Durand,et al.  Understanding Infographics through Textual and Visual Tag Prediction , 2017, ArXiv.

[28]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Allan Jabri,et al.  Learning Visual N-Grams from Web Data , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Abhinav Gupta,et al.  Learning from Noisy Large-Scale Datasets with Minimal Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Pietro Perona,et al.  The Devil is in the Tails: Fine-grained Classification in the Wild , 2017, ArXiv.