Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text. In recent years, TBPS has made remarkable progress and state-of-the-art methods achieve superior performance by learning local fine-grained correspondence between images and texts. However, most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities, which is unreliable due to the lack of contextual information or the potential introduction of noise. Moreover, existing methods seldom consider the information inequality problem between modalities caused by image-specific information. To address these limitations, we propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels, and realize fast and effective person search. Specifically, we first design an image-specific information suppression module, which suppresses image background and environmental factors by relation-guided localization and channel attention filtration respectively. This module effectively alleviates the information inequality problem and realizes the alignment of information volume between images and texts. Secondly, we propose an implicit local alignment module to adaptively aggregate all pixel/word features of image/text to a set of modality-shared semantic topic centers and implicitly learn the local fine-grained correspondence between modalities without additional supervision and cross-modal interactions. And a global alignment is introduced as a supplement to the local perspective. The cooperation of global and local alignment modules enables better semantic alignment between modalities. Extensive experiments on multiple databases demonstrate the effectiveness and superiority of our MANet.

[1]  Xiaoguang Han,et al.  LapsCore: Language-guided Person Search via Color Reasoning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Gang Hua,et al.  DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval , 2021, ACM Multimedia.

[3]  Dacheng Tao,et al.  Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification , 2021, ArXiv.

[4]  Xiangbo Shu,et al.  Global-Local Multiple Granularity Learning for Cross-Modality Visible-Infrared Person Reidentification. , 2021, IEEE transactions on neural networks and learning systems.

[5]  Rongrong Ji,et al.  Attention-Based Neural Architecture Search for Person Re-Identification , 2021, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Yuhui Zheng,et al.  TIPCB: A Simple but Effective Part-based Convolutional Baseline for Text-based Person Search , 2021, Neurocomputing.

[7]  Shiguang Shan,et al.  BiCnet-TKS: Learning Efficient Spatial-Temporal Representation for Video Person Re-Identification , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Bingpeng Ma,et al.  Cross-Modal Knowledge Adaptation for Language-Based Person Search , 2021, IEEE Transactions on Image Processing.

[9]  J. Kittler,et al.  AXM-Net: Implicit Cross-Modal Feature Alignment for Person Re-identification , 2021, AAAI.

[10]  Jun Zhang,et al.  Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search , 2021, ArXiv.

[11]  Huchuan Lu,et al.  Similarity Reasoning and Filtration for Image-Text Matching , 2021, AAAI.

[12]  Kecheng Zheng,et al.  Hierarchical Gumbel Attention Network for Text-based Person Search , 2020, ACM Multimedia.

[13]  Kai Niu,et al.  Textual Dependency Embedding for Person Search by Language , 2020, ACM Multimedia.

[14]  Dapeng Tao,et al.  Attribute-Identity Embedding and Self-Supervised Learning for Scalable Person Re-Identification , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[15]  Ling Shao,et al.  Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification , 2020, ECCV.

[16]  Gang Hua,et al.  IMG-Net: inner-cross-modal attentional multigranular network for description-based person re-identification , 2020, J. Electronic Imaging.

[17]  Cuiling Lan,et al.  Style Normalization and Restitution for Generalizable Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Zhe Wang,et al.  ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language , 2020, ECCV.

[19]  Houqiang Li,et al.  Relation-Guided Spatial Attention and Temporal Refinement for Video-Based Person Re-Identification , 2020, AAAI.

[20]  Chunxiao Liu,et al.  Graph Structured Network for Image-Text Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  R. Venkatesh Babu,et al.  Text-based Person Search via Attribute-aided Matching , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[22]  Tao Xiang,et al.  Leader-Based Multi-Scale Attention Deep Architecture for Person Re-Identification , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Zhen Liu,et al.  Visual-Textual Association with Hardest and Semi-Hard Negative Pairs Mining for Person Search , 2019, ArXiv.

[24]  Yongdong Zhang,et al.  Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search , 2019, ACM Multimedia.

[25]  Ioannis A. Kakadiaris,et al.  Adversarial Representation Learning for Text-to-Image Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Liang Wang,et al.  Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments , 2019, IEEE Transactions on Image Processing.

[27]  Shuang Wang,et al.  Language Person Search with Mutually Connected Classification Loss , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Cuiling Lan,et al.  Relation-Aware Global Attention for Person Re-Identification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Wei Zhang,et al.  Feature Aggregation With Reinforcement Learning for Video-Based Person Re-Identification , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[30]  Wen Gao,et al.  Attention Driven Person Re-identification , 2018, Pattern Recognit..

[31]  T. Tan,et al.  Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search , 2018, AAAI.

[32]  Huchuan Lu,et al.  Deep Cross-Modal Projection Learning for Image-Text Matching , 2018, ECCV.

[33]  Xiaogang Wang,et al.  Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association , 2018, ECCV.

[34]  Xiaoou Tang,et al.  Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net , 2018, ECCV.

[35]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[36]  Liang Wang,et al.  Mask-Guided Contrastive Attention Model for Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[38]  Jiebo Luo,et al.  Improving Text-Based Person Search by Spatial Matching and Adaptive Threshold , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[39]  Longhui Wei,et al.  Person Transfer GAN to Bridge Domain Gap for Person Re-identification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Zhedong Zheng,et al.  Dual-path Convolutional Image-Text Embeddings with Instance Loss , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[42]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Xiaogang Wang,et al.  Identity-Aware Textual-Visual Matching with Latent Co-attention , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[45]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Xiaogang Wang,et al.  Person Search with Natural Language Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Josef Sivic,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[50]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Shaozi Li,et al.  Text-based Person Search via Multi-Granularity Embedding Learning , 2021, IJCAI.