AAformer: Auto-Aligned Transformer for Person Re-Identification

Transformer is showing its superiority over convolutional architectures in many vision tasks like image classification and object detection. However, the lacking of an explicit alignment mechanism limits its capability in person re-identification (re-ID), in which there are inevitable misalignment issues caused by pose/viewpoints variations, etc. On the other hand, the alignment paradigm of convolutional neural networks does not perform well in Transformer in our experiments. To address this problem, we develop a novel alignment framework for Transformer through adding the learnable vectors of “part tokens” to learn the part representations and integrating the part alignment into the selfattention. A part token only interacts with a subset of patch embeddings and learns to represent this subset. Based on the framework, we design an online Auto-Aligned Transformer (AAformer) to adaptively assign the patch embeddings of the same semantics to the identical part token in the running time. The part tokens can be regarded as the part prototypes, and a fast variant of Sinkhorn-Knopp algorithm is employed to cluster the patch embeddings to part tokens online. AAformer can be viewed as a new principled formulation for simultaneously learning both part alignment and part representations. Extensive experiments validate the effectiveness of part tokens and the superiority of AAformer over various state-of-the-art CNN-based methods. Our codes will be released.

[1]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[2]  Longhui Wei,et al.  Person Transfer GAN to Bridge Domain Gap for Person Re-identification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Yi Yang,et al.  Unlabeled Samples Generated by GAN Improve the Person Re-identification Baseline in Vitro , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[5]  Pichao Wang,et al.  TransReID: Transformer-based Object Re-Identification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Jianyuan Guo,et al.  Beyond Human Parts: Dual Part-Aligned Representations for Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Kaiqi Huang,et al.  Towards Rich Feature Discovery With Class Activation Maps Augmentation for Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ziyan Wu,et al.  Re-Identification With Consistent Attentive Siamese Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Shaogang Gong,et al.  Harmonious Attention Network for Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Cheng Wang,et al.  Mancs: A Multi-task Attentional Network with Curriculum Sampling for Person Re-Identification , 2018, ECCV.

[12]  Haiqing Li,et al.  Deep Spatial Feature Reconstruction for Partial Person Re-identification: Alignment-free Approach , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Xiaogang Wang,et al.  FD-GAN: Pose-guided Feature Distilling GAN for Robust Person Re-identification , 2018, NeurIPS.

[14]  Zhenan Sun,et al.  Recognizing Partial Biometric Patterns , 2018, ArXiv.

[15]  Tao Mei,et al.  Part-Aligned Bilinear Representations for Person Re-identification , 2018, ECCV.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Qi Tian,et al.  Beyond Part Models: Person Retrieval with Refined Part Pooling , 2017, ECCV.

[18]  Zheng Zhang,et al.  Disentangled Non-Local Neural Networks , 2020, ECCV.

[19]  Kurt Keutzer,et al.  Visual Transformers: Token-based Image Representation and Processing for Computer Vision , 2020, ArXiv.

[20]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Bingbing Ni,et al.  Pose Transferrable Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Liang Zheng,et al.  Re-ranking Person Re-identification with k-Reciprocal Encoding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[24]  Pichao Wang,et al.  Trear: Transformer-Based RGB-D Egocentric Action Recognition , 2021, IEEE Transactions on Cognitive and Developmental Systems.

[25]  Andrea Vedaldi,et al.  Self-labelling via simultaneous clustering and representation learning , 2020, ICLR.

[26]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Wu Liu,et al.  Guided Saliency Feature Learning for Person Re-identification in Crowded Scenes , 2020, ECCV.

[28]  Lars Petersson,et al.  Bilinear Attention Networks for Person Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[31]  Yongdong Zhang,et al.  Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Shiyu Chang,et al.  TransGAN: Two Transformers Can Make One Strong GAN , 2021, ArXiv.

[34]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Kim-Hui Yap,et al.  AANet: Attribute Attention Network for Person Re-Identifications , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yu Wu,et al.  Pose-Guided Feature Alignment for Occluded Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[38]  Zhedong Zheng,et al.  Joint Discriminative and Generative Learning for Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[40]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[41]  Shiguang Shan,et al.  Interaction-And-Aggregation Network for Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Wei Jiang,et al.  Bag of Tricks and a Strong Baseline for Deep Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[43]  Jinqing Qi,et al.  HAT: Hierarchical Aggregation Transformers for Person Re-identification , 2021, ACM Multimedia.

[44]  Gang Yu,et al.  High-Order Information Matters: Learning Relation and Topology for Occluded Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Andrea Cavallaro,et al.  Omni-Scale Feature Learning for Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Ming Tang,et al.  Identity-Guided Human Semantic Parsing for Person Re-Identification , 2020, ECCV.

[47]  Wenjun Zeng,et al.  Densely Semantically Aligned Person Re-Identification , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[51]  Yan Wang,et al.  Resource Aware Person Re-identification Across Multiple Resolutions , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Huchuan Lu,et al.  Pose-Invariant Embedding for Deep Person Re-Identification , 2017, IEEE Transactions on Image Processing.

[53]  Xiaogang Wang,et al.  DeepReID: Deep Filter Pairing Neural Network for Person Re-identification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Muhittin Gokmen,et al.  Human Semantic Parsing for Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Xiong Chen,et al.  Learning Discriminative Features with Multiple Granularities for Person Re-Identification , 2018, ACM Multimedia.

[56]  Kaiqi Huang,et al.  Adversarially Occluded Samples for Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Jian Sun,et al.  AlignedReID: Surpassing Human-Level Performance in Person Re-Identification , 2017, ArXiv.