Exploiting the relationship between visual and textual features in social networks for image classification with zero-shot deep learning

One of the main issues related to unsupervised machine learning is the cost of processing and extracting useful information from large datasets. In this work, we propose a classifier ensemble based on the transferable learning capabilities of the CLIP neural network architecture in multimodal environments (image and text) from social media. For this purpose, we used the InstaNY100K dataset and proposed a validation approach based on sampling techniques. Our experiments, based on image classification tasks according to the labels of the Places dataset, are performed by first considering only the visual part, and then adding the associated texts as support. The results obtained demonstrated that trained neural networks such as CLIP can be successfully applied to image classification with little fine-tuning, and considering the associated texts to the images can help to improve the accuracy depending on the goal. The results demonstrated what seems to be a promising research direction.

[1]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[2]  Xiaojun Wan,et al.  Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model , 2019, ACL.

[3]  Dimosthenis Karatzas,et al.  Learning to Learn from Web Data through Deep Semantic Embeddings , 2018, ECCV Workshops.

[4]  Jennifer Pan,et al.  CASM: A Deep-Learning Approach for Identifying Collective Action Events with Text and Image Data from Social Media , 2019, Sociological Methodology.

[5]  Yogesh Kumar Dwivedi,et al.  A deep multi-modal neural network for informative Twitter content classification during emergencies , 2020, Ann. Oper. Res..

[6]  Muhammad Imran,et al.  Detection of Disaster-Affected Cultural Heritage Sites from Social Media Images Using Deep Learning Techniques , 2020, ACM Journal on Computing and Cultural Heritage.

[7]  Shiguang Shan,et al.  Transferable Contrastive Network for Generalized Zero-Shot Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[9]  CASM: A Deep-Learning Approach for Identifying Collective Action Events with Text and Image Data from Social Media , 2019 .

[10]  Christopher D. Manning,et al.  Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.

[11]  Guohui Li,et al.  A Multi-modal Hashing Learning Framework for Automatic Image Annotation , 2017, 2017 IEEE Second International Conference on Data Science in Cyberspace (DSC).

[12]  Jaemin Cho,et al.  Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[13]  Cordelia Schmid,et al.  Multimodal semi-supervised learning for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Chih-Fong Tsai,et al.  On learning dual classifiers for better data classification , 2015, Appl. Soft Comput..