JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual Features

Learning social media content is the basis of many real-world applications, including information retrieval and recommendation systems, among others. In contrast with previous works that focus mainly on single modal or bi-modal learning, we propose to learn social media content by fusing jointly textual, acoustic, and visual information (JTAV). Effective strategies are proposed to extract fine-grained features of each modality, that is, attBiGRU and DCRNN. We also introduce cross-modal fusion and attentive pooling techniques to integrate multi-modal information comprehensively. Extensive experimental evaluation conducted on real-world datasets demonstrates our proposed model outperforms the state-of-the-art approaches by a large margin.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Xavier Serra,et al.  Multi-Label Music Genre Classification from Audio, Text and Images Using Deep Features , 2017, ISMIR.

[3]  Yongjian Wu,et al.  Fusing transcription results from polyphonic and monophonic audio for singing melody transcription in polyphonic music , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Venkata Rama Kiran Garimella,et al.  Social Media Image Analysis for Public Health , 2015, CHI.

[5]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[6]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[7]  Yaoxue Zhang,et al.  Mobile Contextual Recommender System for Online Social Media , 2017, IEEE Transactions on Mobile Computing.

[8]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[9]  György Fazekas,et al.  A Tutorial on Deep Learning for Music Information Retrieval , 2017, ArXiv.

[10]  Timothy Baldwin,et al.  Lexical normalization for social media text , 2013, TIST.

[11]  Gilad Mishne,et al.  Finding high-quality content in social media , 2008, WSDM '08.

[12]  Lifeng Sun,et al.  Social Media Recommendation , 2013, Social Media Retrieval.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Yunhong Wang,et al.  Visual and textual sentiment analysis using deep fusion convolutional neural networks , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[15]  Judith C. Brown,et al.  An efficient algorithm for the calculation of a constant Q transform , 1992 .

[16]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[17]  Aren Jansen,et al.  Towards Learning Semantic Audio Representations from Unlabeled Data , 2017 .

[18]  Xuelong Li,et al.  Image2song: Song Retrieval via Bridging Image Content and Lyric Words , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Baoxin Li,et al.  Unsupervised Sentiment Analysis for Social Media Images , 2015, IJCAI.

[20]  Mike Thelwall,et al.  Sensing Social Media: A Range of Approaches for Sentiment Analysis , 2017 .

[21]  Woobin Im,et al.  Image-Text Multi-Modal Representation Learning by Adversarial Backpropagation , 2016, ArXiv.

[22]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Paul Rad,et al.  A deep learning approach for mapping music genres , 2017, 2017 12th System of Systems Engineering Conference (SoSE).

[24]  Lei Wang,et al.  Transfer Learning for Music Classification and Regression Tasks Using Artist Tags , 2020 .

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Alexandros Tsaptsinos Lyrics-Based Music Genre Classification Using a Hierarchical Attention Network , 2017, ISMIR.

[27]  Boualem Boashash,et al.  Time frequency signal analysis: Past, present and future trends , 1996 .