Text-based Person Search without Parallel Image-Text Data

Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data ($\mu$-TBPS), in which only non-parallel images and texts, or even image-only data, can be adopted. Towards this end, we propose a two-stage framework, generation-then-retrieval (GTR), to first generate the corresponding pseudo text for each image and then perform the retrieval in a supervised manner. In the generation stage, we propose a fine-grained image captioning strategy to obtain an enriched description of the person image, which firstly utilizes a set of instruction prompts to activate the off-the-shelf pretrained vision-language model to capture and generate fine-grained person attributes, and then converts the extracted attributes into a textual description via the finetuned large language model or the hand-crafted template. In the retrieval stage, considering the noise interference of the generated texts for training model, we develop a confidence score-based training scheme by enabling more reliable texts to contribute more during the training. Experimental results on multiple TBPS benchmarks (i.e., CUHK-PEDES, ICFG-PEDES and RSTPReid) show that the proposed GTR can achieve a promising performance without relying on parallel image-text data.

[1]  J. Liu,et al.  Calibrating Cross-modal Features for Text-Based Person Searching , 2023, 2304.02278.

[2]  Mang Ye,et al.  Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yuan Wu,et al.  Asymmetric Cross-Scale Alignment for Text-Based Person Search , 2022, IEEE Transactions on Multimedia.

[4]  Jinhui Tang,et al.  CLIP-Driven Fine-Grained Text-Image Person Re-Identification , 2022, IEEE Transactions on Image Processing.

[5]  X. Bai,et al.  Conditional Feature Learning Based Transformer for Text-Based Person Search , 2022, IEEE Transactions on Image Processing.

[6]  Xili Wan,et al.  CAIBC: Capturing All-round Information Beyond Color for Text-based Person Retrieval , 2022, ACM Multimedia.

[7]  Xili Wan,et al.  Look Before You Leap: Improving Text-based Person Retrieval by Learning A Consistent Cross-modal Common Manifold , 2022, ACM Multimedia.

[8]  Xiao Wang,et al.  See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval , 2022, ECCV Workshops.

[9]  Changxing Ding,et al.  Learning Granularity-Unified Representations for Text-to-Image Person Re-identification , 2022, ACM Multimedia.

[10]  Wenhao Jiang,et al.  VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix , 2022, ICML.

[11]  Liqiang Nie,et al.  Image-text Retrieval: A Survey on Recent Research and Development , 2022, IJCAI.

[12]  Ning Zhang,et al.  Unsupervised Vision-and-Language Pretraining via Retrieval-based Multi-Granular Alignment , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[14]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[15]  Min Zhang,et al.  Learning Semantic-Aligned Feature Representation for Text-Based Person Search , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Tao Xiang,et al.  Text-Based Person Search with Limited Data , 2021, BMVC.

[17]  Xiaoguang Han,et al.  LapsCore: Language-guided Person Search via Color Reasoning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Changxin Gao,et al.  Weakly Supervised Text-based Person Re-Identification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Gang Hua,et al.  DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval , 2021, ACM Multimedia.

[20]  Dacheng Tao,et al.  Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification , 2021, ArXiv.

[21]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[22]  Shih-Fu Chang,et al.  Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions , 2021, NAACL.

[23]  Yuhui Zheng,et al.  TIPCB: A Simple but Effective Part-based Convolutional Baseline for Text-based Person Search , 2021, Neurocomputing.

[24]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[25]  Meng Wang,et al.  Unpaired Image Captioning With semantic-Constrained Self-Learning , 2021, IEEE Transactions on Multimedia.

[26]  Jiebo Luo,et al.  Unsupervised text-to-image synthesis , 2021, Pattern Recognit..

[27]  Jun Zhang,et al.  Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search , 2021, ArXiv.

[28]  Yuning Jiang,et al.  Learning the Best Pooling Strategy for Visual Semantic Embedding , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Kecheng Zheng,et al.  Hierarchical Gumbel Attention Network for Text-based Person Search , 2020, ACM Multimedia.

[30]  Yang Wang,et al.  Recurrent Relational Memory Network for Unsupervised Image Captioning , 2020, IJCAI.

[31]  Liang Wang,et al.  Cross-Modal Cross-Domain Moment Alignment Network for Person Search , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Zhe Wang,et al.  ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language , 2020, ECCV.

[33]  Mang Ye,et al.  A Survey of Open-World Person Re-Identification , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[34]  Josef Kittler,et al.  A Convolutional Baseline for Person Re-Identification Using Vision and Language Descriptions , 2020, ArXiv.

[35]  Tao Xiang,et al.  Deep Learning for Person Re-Identification: A Survey and Outlook , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Tianhao Zhang,et al.  Exploring Semantic Relationships for Image Captioning without Parallel Data , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[37]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[38]  Nassir Navab,et al.  Towards Unsupervised Image Captioning With Shared Multimodal Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Gang Wang,et al.  Unpaired Image Captioning via Scene Graph Alignments , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Yang Feng,et al.  Unsupervised Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  T. Tan,et al.  Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search , 2018, AAAI.

[42]  Huchuan Lu,et al.  Deep Cross-Modal Projection Learning for Image-Text Matching , 2018, ECCV.

[43]  Xiaogang Wang,et al.  Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association , 2018, ECCV.

[44]  Mark Hopkins,et al.  Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples , 2018, ACL.

[45]  Zhedong Zheng,et al.  Dual-path Convolutional Image-Text Embeddings with Instance Loss , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[46]  Xiaogang Wang,et al.  Identity-Aware Textual-Visual Matching with Latent Co-attention , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  Xiaogang Wang,et al.  Person Search with Natural Language Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Peifeng Wang,et al.  A Simple and Robust Correlation Filtering Method for Text-Based Person Search , 2022, ECCV.

[53]  Maosong Sun,et al.  End-to-End Unsupervised Vision-and-Language Pre-training with Referring Expression Matching , 2022, EMNLP.

[54]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[55]  Steven Bird NLTK: The Natural Language Toolkit , 2006, ACL.