论文信息 - Set Augmented Triplet Loss for Video Person Re-Identification

Set Augmented Triplet Loss for Video Person Re-Identification

Modern video person re-identification (re-ID) machines are often trained using a metric learning approach, supervised by a triplet loss. The triplet loss used in video re-ID is usually based on so-called clip features, each aggregated from a few frame features. In this paper, we propose to model the video clip as a set and instead study the distance between sets in the corresponding triplet loss. In contrast to the distance between clip representations, the distance between clip sets considers the pair-wise similarity of each element (i.e., frame representation) between two sets. This allows the network to directly optimize the feature representation at a frame level. Apart from the commonly-used set distance metrics (e.g., ordinary distance and Hausdorff distance), we further propose a hybrid distance metric, tailored for the set-aware triplet loss. Also, we propose a hard positive set construction strategy using the learned class prototypes in a batch. Our proposed method achieves state-of-the-art results across several standard benchmarks, demonstrating the advantages of the proposed method.

[1] Yi Yang,et al. Person Re-identification: Past, Present and Future , 2016, ArXiv.

[2] Yu Wu,et al. Exploit the Unknown Gradually: One-Shot Video-Based Person Re-identification by Stepwise Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3] Qi Tian,et al. MARS: A Video Benchmark for Large-Scale Person Re-Identification , 2016, ECCV.

[4] Hongtao Lu,et al. Attribute-Driven Feature Disentangling and Temporal Aggregation for Video Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Xinggang Wang,et al. Learning generalizable deep feature using triplet-batch-center loss for person re-identification , 2020, Science China Information Sciences.

[6] Qi Tian,et al. Beyond Part Models: Person Retrieval with Refined Part Pooling , 2017, ECCV.

[7] Wei-Shi Zheng,et al. Spatial-Temporal Graph Convolutional Network for Video-Based Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Ramakant Nevatia,et al. Revisiting Temporal Modeling for Video-based Person ReID , 2018, ArXiv.

[9] Yi Yang,et al. Random Erasing Data Augmentation , 2017, AAAI.

[10] Bohyung Han,et al. Stochastic Class-Based Hard Example Mining for Deep Metric Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Xiaogang Wang,et al. Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12] David A. McAllester,et al. Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13] Yunchao Wei,et al. STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification , 2018, AAAI.

[14] Jesús Martínez del Rincón,et al. Recurrent Convolutional Network for Video-Based Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[16] Tao Mei,et al. Part-Aligned Bilinear Representations for Person Re-identification , 2018, ECCV.

[17] Mehrtash Harandi,et al. Channel Recurrent Attention Networks for Video Pedestrian Retrieval , 2020, ACCV.

[18] Horst Bischof,et al. Person Re-identification by Descriptive and Discriminative Classification , 2011, SCIA.

[19] Xiaogang Wang,et al. Video Person Re-identification with Competitive Snippet-Similarity Aggregation and Co-attentive Snippet Embedding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21] Shiguang Shan,et al. VRSTC: Occlusion-Free Video Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Yu Liu,et al. Quality Aware Network for Set to Set Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Leonidas J. Guibas,et al. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Shaogang Gong,et al. Person Re-Identification by Discriminative Selection in Video Ranking , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25] Cheng Wang,et al. Mancs: A Multi-task Attentional Network with Curriculum Sampling for Person Re-Identification , 2018, ECCV.

[26] Lars Petersson,et al. Bilinear Attention Networks for Person Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27] Alexander J. Smola,et al. Deep Sets , 2017, 1703.06114.

[28] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29] Lucas Beyer,et al. In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[30] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[31] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[32] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Edward J. Delp,et al. Locating Objects Without Bounding Boxes , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Wenjun Zeng,et al. Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-Based Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Richard Nock,et al. Siamese Networks: The Tale of Two Manifolds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37] Xiaogang Wang,et al. SCAN: Self-and-Collaborative Attention Network for Video Person Re-Identification , 2018, IEEE Transactions on Image Processing.

[38] Afshin Dehghan,et al. GMMCP tracker: Globally optimal Generalized Maximum Multi Clique problem for multiple object tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Bingbing Ni,et al. Person Re-identification via Recurrent Feature Aggregation , 2016, ECCV.

[40] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41] Houqiang Li,et al. Spatial and Temporal Mutual Promotion for Video-based Person Re-identification , 2018, AAAI.

[42] Anurag Mittal,et al. Co-Segmentation Inspired Attention Networks for Video-Based Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).