Learning Semantic-Aligned Feature Representation for Text-Based Person Search

Text-based person search aims to retrieve images of a certain pedestrian by a textual description. The key challenge of this task is to eliminate the inter-modality gap and achieve the feature alignment across modalities. In this paper, we propose a semantic-aligned embedding method for text-based person search, in which the feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features. First, we introduce two Transformer-based backbones to encode robust feature representations of the images and texts. Second, we design a semantic-aligned feature aggregation network to adaptively select and aggregate features with the same semantics into part-aware features, which is achieved by a multi-head attention module constrained by a cross-modality part alignment loss and a diversity loss. Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.

[1]  Huchuan Lu,et al.  Deep Cross-Modal Projection Learning for Image-Text Matching , 2018, ECCV.

[2]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[3]  Kecheng Zheng,et al.  Hierarchical Gumbel Attention Network for Text-based Person Search , 2020, ACM Multimedia.

[4]  Zhe Wang,et al.  ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language , 2020, ECCV.

[5]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[6]  Shaozi Li,et al.  Text-based Person Search via Multi-Granularity Embedding Learning , 2021, IJCAI.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Dacheng Tao,et al.  Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification , 2021, ArXiv.

[9]  Xiaogang Wang,et al.  Identity-Aware Textual-Visual Matching with Latent Co-attention , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Shuang Wang,et al.  Language Person Search with Mutually Connected Classification Loss , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Zhedong Zheng,et al.  Dual-path Convolutional Image-Text Embeddings with Instance Loss , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[12]  Tieniu Tan,et al.  Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search , 2020, AAAI.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Zheng-Jun Zha,et al.  Adversarial Attribute-Text Embedding for Person Search With Natural Language Query , 2020, IEEE Transactions on Multimedia.

[15]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[16]  R. Venkatesh Babu,et al.  Text-based Person Search via Attribute-aided Matching , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17]  Liang Wang,et al.  Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments , 2019, IEEE Transactions on Image Processing.

[18]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yongdong Zhang,et al.  Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search , 2019, ACM Multimedia.

[20]  Xiaogang Wang,et al.  Person Search with Natural Language Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Xiaogang Wang,et al.  Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association , 2018, ECCV.