论文信息 - Learning Semantic-Aligned Feature Representation for Text-Based Person Search

Learning Semantic-Aligned Feature Representation for Text-Based Person Search

Text-based person search aims to retrieve images of a certain pedestrian by a textual description. The key challenge of this task is to eliminate the inter-modality gap and achieve the feature alignment across modalities. In this paper, we propose a semantic-aligned embedding method for text-based person search, in which the feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features. First, we introduce two Transformer-based backbones to encode robust feature representations of the images and texts. Second, we design a semantic-aligned feature aggregation network to adaptively select and aggregate features with the same semantics into part-aware features, which is achieved by a multi-head attention module constrained by a cross-modality part alignment loss and a diversity loss. Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.

[1] Huchuan Lu,et al. Deep Cross-Modal Projection Learning for Image-Text Matching , 2018, ECCV.

[2] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[3] Kecheng Zheng,et al. Hierarchical Gumbel Attention Network for Text-based Person Search , 2020, ACM Multimedia.

[4] Zhe Wang,et al. ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language , 2020, ECCV.

[5] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[6] Shaozi Li,et al. Text-based Person Search via Multi-Granularity Embedding Learning , 2021, IJCAI.

[7] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[8] Dacheng Tao,et al. Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification , 2021, ArXiv.

[9] Xiaogang Wang,et al. Identity-Aware Textual-Visual Matching with Latent Co-attention , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10] Shuang Wang,et al. Language Person Search with Mutually Connected Classification Loss , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Zhedong Zheng,et al. Dual-path Convolutional Image-Text Embeddings with Instance Loss , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[12] Tieniu Tan,et al. Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search , 2020, AAAI.

[13] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14] Zheng-Jun Zha,et al. Adversarial Attribute-Text Embedding for Person Search With Natural Language Query , 2020, IEEE Transactions on Multimedia.

[15] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[16] R. Venkatesh Babu,et al. Text-based Person Search via Attribute-aided Matching , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17] Liang Wang,et al. Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments , 2019, IEEE Transactions on Image Processing.

[18] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Yongdong Zhang,et al. Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search , 2019, ACM Multimedia.

[20] Xiaogang Wang,et al. Person Search with Natural Language Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Xiaogang Wang,et al. Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association , 2018, ECCV.