Multi-modal Label Retrieval for the Visual Arts: The Case of Iconclass

Iconclass is an iconographic classification system from the domain of cultural heritage which is used to annotate subjects represented in the visual arts. In this work, we investigate the feasibility of automatically assigning Iconclass codes to visual artworks using a cross-modal retrieval set-up. We explore the text and image branches of the cross-modal network. In addition, we describe a multi-modal architecture that can jointly capitalize on multiple feature sources: textual features, coming from the titles for these artworks (in multiple languages) and visual features, extracted from photographic reproductions of the artworks. We utilize Iconclass definitions in English as matching labels. We evaluate our approach on a publicly available dataset of artworks (containing English and Dutch titles). Our results demonstrate that, in isolation, textual features strongly outperform visual features, although visual features can still offer a useful complement to purely linguistic features. Moreover, we show the cross-lingual (Dutch-English) strategy to be on par with the monolingual approach (English-English), which opens important perspectives for applications of this approach beyond resource-rich languages.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Jian Wang,et al.  Deep Metric Learning with Angular Loss , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[4]  Jungong Han,et al.  Attribute-Guided Network for Cross-Modal Zero-Shot Hashing , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jia Zhu,et al.  Deep Pairwise Ranking with Multi-label Information for Cross-Modal Retrieval , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[8]  L. D. Couprie Iconclass: an iconographic classification system , 1983 .

[9]  Huchuan Lu,et al.  Deep Cross-Modal Projection Learning for Image-Text Matching , 2018, ECCV.

[10]  Ioannis A. Kakadiaris,et al.  Adversarial Representation Learning for Text-to-Image Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[12]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[13]  M. Pontil,et al.  Machine Learning for Cultural Heritage: A Survey , 2020, Pattern Recognit. Lett..

[14]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Qingming Huang,et al.  Online Asymmetric Metric Learning With Multi-Layer Similarity Aggregation for Cross-Modal Retrieval , 2019, IEEE Transactions on Image Processing.

[19]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[20]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[21]  Mauricio Marengoni,et al.  A Survey of Transfer Learning for Convolutional Neural Networks , 2019, 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images Tutorials (SIBGRAPI-T).

[22]  M. Kestemont,et al.  Neural Machine Translation of Artwork Titles Using Iconclass Codes , 2020, LATECHCLFL.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[25]  Qingming Huang,et al.  Learning Fragment Self-Attention Embeddings for Image-Text Matching , 2019, ACM Multimedia.

[26]  Piero Fraternali,et al.  A Dataset and a Convolutional Model for Iconography Classification in Paintings , 2020, ACM Journal on Computing and Cultural Heritage.

[27]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[28]  Zhou Zhao,et al.  Cross-modal Image-Text Retrieval with Multitask Learning , 2019, CIKM.

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  Yongdong Zhang,et al.  Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search , 2019, ACM Multimedia.