CLIP-Driven Fine-Grained Text-Image Person Re-Identification

—Text-Image Person Re-identification aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondence information. Besides, due to the substantial gap between modalities, existing methods embed the original modal features into the same latent space for cross- modal alignment. However, feature embedding may lead to intra-modal information distortion. Recently, Contrastive Language- Image Pretraining (CLIP) has attracted extensive attention from researchers due to its powerful semantic concept learning capacity and rich multi-modal knowledge, which can help us solve the above problems. Accordingly, in this paper, we propose a CLIP- driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID. To transfer the multi-modal knowledge effectively, we perform fine-grained information excavation to mine intra-modal discrimina- tive clues and inter-modal correspondences. Specifically, we first design a multi-grained global feature learning (MGF) module to fully mine the discriminative local information within each modality, which can emphasize identity-related discriminative clues by enhancing the interactions between global image (text) and informative local patches (words). MGF can generate a set of multi-grained global features for later inference. Secondly, cross-grained feature refinement (CFR) and fine-grained correspondence discovery (FCD) modules are proposed to establish the cross-grained and fine-grained interactions (image-word, sentence-patch, word-patch) between modalities, which can filter out unimportant and non-modality-shared image patches/words and mine cross- modal correspondences from coarse to fine. CFR and FCD are removed during inference to save computational costs. Note that the above process is performed in the original modality space without further feature embedding. Extensive experiments on multiple benchmarks demonstrate the superior performance of our method on TIReID.

[1]  Jinhui Tang,et al.  Boosting Few-Shot Fine-Grained Recognition With Background Suppression and Foreground Alignment , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Min Wang,et al.  Cross-Modal Retrieval with Heterogeneous Graph Embedding , 2022, ACM Multimedia.

[3]  Xili Wan,et al.  Look Before You Leap: Improving Text-based Person Retrieval by Learning A Consistent Cross-modal Common Manifold , 2022, ACM Multimedia.

[4]  Xili Wan,et al.  CAIBC: Capturing All-round Information Beyond Color for Text-based Person Retrieval , 2022, ACM Multimedia.

[5]  Xiao Wang,et al.  See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval , 2022, ECCV Workshops.

[6]  Haibin Ling,et al.  Expanding Language-Image Pretrained Models for General Video Recognition , 2022, ECCV.

[7]  Hui Su,et al.  Re-Attention Transformer for Weakly Supervised Object Localization , 2022, BMVC.

[8]  Jue Wang,et al.  LocVTP: Video-Text Pre-training for Temporal Localization , 2022, ECCV.

[9]  Changxing Ding,et al.  Learning Granularity-Unified Representations for Text-to-Image Person Re-identification , 2022, ACM Multimedia.

[10]  Luhui Xu,et al.  TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval , 2022, ECCV.

[11]  Ming Yan,et al.  X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval , 2022, ACM Multimedia.

[12]  Chunhui Liu,et al.  LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval , 2022, ArXiv.

[13]  Yi Shan,et al.  Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yi Yang,et al.  CenterCLIP: Token Clustering for Efficient Text-Video Retrieval , 2022, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[15]  Jinhui Tang,et al.  Learning attention-guided pyramidal features for few-shot fine-grained recognition , 2022, Pattern Recognit..

[16]  Fangqiang Hu,et al.  SUM: Serialized Updating and Matching for text-based person retrieval , 2022, Knowl. Based Syst..

[17]  Min Zhang,et al.  Learning Semantic-Aligned Feature Representation for Text-Based Person Search , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Peng Gao,et al.  PointCLIP: Point Cloud Understanding by CLIP , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jiwen Lu,et al.  DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Tongliang Liu,et al.  CRIS: CLIP-Driven Referring Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Zhenguo Li,et al.  FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.

[22]  Zhengtao Yu,et al.  Cross-domain person re-identification with pose-invariant feature decomposition and hypergraph structure alignment , 2021, Neurocomputing.

[23]  Yuhui Zheng,et al.  TIPCB: A Simple but Effective Part-based Convolutional Baseline for Text-based Person Search , 2021, Neurocomputing.

[24]  Nan Duan,et al.  CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[25]  J. Kittler,et al.  AXM-Net: Implicit Cross-Modal Feature Alignment for Person Re-identification , 2021, AAAI.

[26]  Tao Xiang,et al.  Text-Based Person Search with Limited Data , 2021, BMVC.

[27]  Fengyun Rao,et al.  CLIP4Caption: CLIP for Video Caption , 2021, ACM Multimedia.

[28]  Xiaoguang Han,et al.  LapsCore: Language-guided Person Search via Color Reasoning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Gang Hua,et al.  DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval , 2021, ACM Multimedia.

[30]  Dacheng Tao,et al.  Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification , 2021, ArXiv.

[31]  Pengfei Xiong,et al.  CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.

[32]  Xiangbo Shu,et al.  Global-Local Multiple Granularity Learning for Cross-Modality Visible-Infrared Person Reidentification. , 2021, IEEE transactions on neural networks and learning systems.

[33]  Bingpeng Ma,et al.  Cross-Modal Knowledge Adaptation for Language-Based Person Search , 2021, IEEE Transactions on Image Processing.

[34]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[35]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[36]  Jun Zhang,et al.  Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search , 2021, ArXiv.

[37]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[38]  Aichun Zhu,et al.  AMEN: Adversarial Multi-space Embedding Network for Text-Based Person Re-identification , 2021, PRCV.

[39]  Shaozi Li,et al.  Text-based Person Search via Multi-Granularity Embedding Learning , 2021, IJCAI.

[40]  Kecheng Zheng,et al.  Hierarchical Gumbel Attention Network for Text-based Person Search , 2020, ACM Multimedia.

[41]  Zechao Li,et al.  BlockMix: Meta Regularization and Self-Calibrated Inference for Metric-Based Meta-Learning , 2020, ACM Multimedia.

[42]  Kai Niu,et al.  Textual Dependency Embedding for Person Search by Language , 2020, ACM Multimedia.

[43]  Dapeng Tao,et al.  Attribute-Identity Embedding and Self-Supervised Learning for Scalable Person Re-Identification , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[44]  Gang Hua,et al.  IMG-Net: inner-cross-modal attentional multigranular network for description-based person re-identification , 2020, J. Electronic Imaging.

[45]  Zhe Wang,et al.  ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language , 2020, ECCV.

[46]  R. Venkatesh Babu,et al.  Text-based Person Search via Attribute-aided Matching , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[47]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[48]  Liang Wang,et al.  Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments , 2019, IEEE Transactions on Image Processing.

[49]  T. Tan,et al.  Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search , 2018, AAAI.

[50]  Zhedong Zheng,et al.  Dual-path Convolutional Image-Text Embeddings with Instance Loss , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[51]  Yongdong Zhang,et al.  Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search , 2019, ACM Multimedia.

[52]  Ioannis A. Kakadiaris,et al.  Adversarial Representation Learning for Text-to-Image Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[54]  Shuang Wang,et al.  Language Person Search with Mutually Connected Classification Loss , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[57]  Huchuan Lu,et al.  Deep Cross-Modal Projection Learning for Image-Text Matching , 2018, ECCV.

[58]  Xiaogang Wang,et al.  Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association , 2018, ECCV.

[59]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[60]  Qi Tian,et al.  Beyond Part Models: Person Retrieval with Refined Part Pooling , 2017, ECCV.

[61]  Longhui Wei,et al.  Person Transfer GAN to Bridge Domain Gap for Person Re-identification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  Xiaogang Wang,et al.  Person Search with Natural Language Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Liyan Zhang,et al.  Query-Driven Approach to Face Clustering and Tagging , 2016, IEEE Transactions on Image Processing.

[64]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.