VELDA: Relating an Image Tweet's Text and Images

Image tweets are becoming a prevalent form of social media, but little is known about their content - textual and visual - and the relationship between the two mediums. Our analysis of image tweets shows that while visual elements certainly play a large role in image-text relationships, other factors such as emotional elements, also factor into the relationship. We develop Visual-Emotional LDA (VELDA), a novel topic model to capture the image-text correlation from multiple perspectives (namely, visual and emotional). Experiments on real-world image tweets in both English and Chinese and other user generated content, show that VELDA significantly outperforms existing methods on cross-modality image retrieval. Even in other domains where emotion does not factor in image choice directly, our VELDA model demonstrates good generalization ability, achieving higher fidelity modeling of such multimedia documents.

[1]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[2]  Cordelia Schmid,et al.  Learning Color Names from Real-World Images , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Tao Chen,et al.  Context-aware Image Tweet Modelling and Recommendation , 2016, ACM Multimedia.

[4]  Vasant Honavar,et al.  Multi-Modal Hierarchical Dirichlet Process Model for Predicting Image Annotation and Image-Object Label Correspondence , 2009, SDM.

[5]  Trevor Darrell,et al.  Learning cross-modality similarity for multinomial data , 2011, 2011 International Conference on Computer Vision.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Daniel P. Huttenlocher,et al.  Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[8]  Alberto Del Bimbo,et al.  Semantics in Visual Information Retrieval , 1999, IEEE Multim..

[9]  Jie Tang,et al.  Can we understand van gogh's mood?: learning to infer affects from images in social networks , 2012, ACM Multimedia.

[10]  Aoying Zhou,et al.  Impact of Multimedia in Sina Weibo: Popularity and Life Span , 2012, CSWS.

[11]  Edward Y. Chang,et al.  PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications , 2009, AAIM.

[12]  Hagai Attias,et al.  Topic regression multi-modal Latent Dirichlet Allocation for image annotation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Tao Chen,et al.  Understanding and classifying image tweets , 2013, ACM Multimedia.

[14]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[15]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[17]  Renjie Liao,et al.  Nonparametric bayesian upstream supervised multi-modal topic models , 2014, WSDM.

[18]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[19]  Allan Hanbury,et al.  Affective image classification using features inspired by psychology and art theory , 2010, ACM Multimedia.

[20]  Yun Yang,et al.  User interest and social influence based emotion prediction for individuals , 2013, ACM Multimedia.

[21]  Trevor Darrell,et al.  Factorized Multi-Modal Topic Model , 2012, UAI.

[22]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[23]  P. Valdez,et al.  Effects of color on emotions. , 1994, Journal of experimental psychology. General.