论文信息 - Towards Generating and Evaluating Iconographic Image Captions of Artworks

Towards Generating and Evaluating Iconographic Image Captions of Artworks

To automatically generate accurate and meaningful textual descriptions of images is an ongoing research challenge. Recently, a lot of progress has been made by adopting multimodal deep learning approaches for integrating vision and language. However, the task of developing image captioning models is most commonly addressed using datasets of natural images, while not many contributions have been made in the domain of artwork images. One of the main reasons for that is the lack of large-scale art datasets of adequate image-text pairs. Another reason is the fact that generating accurate descriptions of artwork images is particularly challenging because descriptions of artworks are more complex and can include multiple levels of interpretation. It is therefore also especially difficult to effectively evaluate generated captions of artwork images. The aim of this work is to address some of those challenges by utilizing a large-scale dataset of artwork images annotated with concepts from the Iconclass classification system. Using this dataset, a captioning model is developed by fine-tuning a transformer-based vision-language pretrained model. Due to the complex relations between image and text pairs in the domain of artwork images, the generated captions are evaluated using several quantitative and qualitative approaches. The performance is assessed using standard image captioning metrics and a recently introduced reference-free metric. The quality of the generated captions and the model’s capacity to generalize to new data is explored by employing the model to another art dataset to compare the relation between commonly generated captions and the genre of artworks. The overall results suggest that the model can generate meaningful captions that indicate a stronger relevance to the art historical context, particularly in comparison to captions obtained from models trained only on natural image datasets.

Eva Cetinic | E. Cetinic

[1] Allan H. Gilbert,et al. Studies In Iconology: Humanistic Themes In The Art Of The Renaissance , 1939 .

[2] Towards Image Caption Generation for Art Historical Data , 2020 .

[3] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[4] Chenhui Chu,et al. A Dataset and Baselines for Visual Question Answering on Art , 2020, ECCV Workshops.

[5] Frédéric Kaplan,et al. Visual Link Retrieval in a Database of Paintings , 2016, ECCV Workshops.

[6] Francesco Fontanella,et al. Pattern recognition and artificial intelligence techniques for cultural heritage , 2020, Pattern Recognit. Lett..

[7] Alexei A. Efros,et al. Discovering Visual Patterns in Art Collections With Spatially-Consistent Feature Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[9] Ahmed El Kholy,et al. UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[10] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[11] Yejin Choi,et al. CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2021, EMNLP.

[12] Alberto Del Bimbo,et al. Visual Question Answering for Cultural Heritage , 2020, IOP Conference Series: Materials Science and Engineering.

[13] Rita Cucchiara,et al. Explaining digital humanities by aligning images and textual descriptions , 2020, Pattern Recognit. Lett..

[14] Eva Cetinic,et al. Understanding and Creating Art with AI: Review and Outlook , 2021, ACM Trans. Multim. Comput. Commun. Appl..

[15] Rita Cucchiara,et al. Aligning Text and Document Illustrations: Towards Visually Explainable Digital Humanities , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[16] Rita Cucchiara,et al. Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences in the Artistic Domain , 2019, ICIAP.

[17] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18] Piero Fraternali,et al. A Dataset and a Convolutional Model for Iconography Classification in Paintings , 2020, ACM Journal on Computing and Cultural Heritage.

[19] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[20] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[21] Sonja Grgic,et al. A Deep Learning Perspective on Beauty, Sentiment, and Remembrance of Art , 2019, IEEE Access.

[22] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23] Erwin Panofsky,et al. Studies In Iconology: Humanistic Themes In The Art Of The Renaissance , 2019 .

[24] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[25] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[26] Marcel Worring,et al. OmniArt , 2018, ACM Trans. Multim. Comput. Commun. Appl..

[27] Marie-Francine Moens,et al. Generating Captions for Images of Ancient Artworks , 2019, ACM Multimedia.

[28] Mohamed Elhoseiny,et al. The Shape of Art History in the Eyes of the Machine , 2018, AAAI.

[29] James She,et al. DeepArt: Learning Joint Representations of Visual Arts , 2017, ACM Multimedia.

[30] Giovanna Castellano,et al. Visual link retrieval and knowledge discovery in painting datasets , 2020, Multimedia Tools and Applications.

[31] Nan Duan,et al. XGPT: Cross-modal Generative Pre-Training for Image Captioning , 2020, NLPCC.

[32] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Margaret Lech,et al. Two-Stage Deep Learning Approach to the Classification of Fine-Art Paintings , 2019, IEEE Access.

[34] Jianfeng Gao,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[35] Sonja Grgic,et al. Learning the Principles of Art History with convolutional neural networks , 2020, Pattern Recognit. Lett..

[36] E. Cetinic. Iconographic Image Captioning for Artworks , 2021, ICPR Workshops.

[37] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[38] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[39] Giovanna Castellano,et al. Deep learning approaches to pattern extraction and recognition in paintings and drawings: an overview , 2021, Neural Computing and Applications.

[40] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[41] C. Redies,et al. Subjective Ratings of Beauty and Aesthetics: Correlations With Statistical Image Properties in Western Oil Paintings , 2017, i-Perception.

[42] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[43] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Andrew Zisserman,et al. In Search of Art , 2014, ECCV Workshops.