Multi-Scale Temporal Cues Learning for Video Person Re-Identification

Temporal cues embedded in videos provide important clues for person Re-Identification (ReID). To efficiently exploit temporal cues with a compact neural network, this work proposes a novel 3D convolution layer called Multi-scale 3D (M3D) convolution layer. The M3D layer is easy to implement and could be inserted into traditional 2D convolution networks to learn multi-scale temporal cues by end-to-end training. According to its inserted location, the M3D layer has two variants, i.e., local M3D layer and global M3D layer, respectively. The local M3D layer is inserted between 2D convolution layers to learn spatial-temporal cues among adjacent 2D feature maps. The global M3D layer is computed on adjacent frame feature vectors to learn their global temporal relations. The local and global M3D layers hence learn complementary temporal cues. Their combination introduces a fraction of parameters to traditional 2D CNN, but leads to the strong multi-scale temporal feature learning capability. The learned temporal feature is fused with a spatial feature to compose the final spatial-temporal representation for video person ReID. Evaluations on four widely used video person ReID datasets, i.e., MARS, DukeMTMC-VideoReID, PRID2011, and iLIDS-VID demonstrate the substantial advantages of our method over the state-of-the art. For example, it achieves rank1 accuracy of 88.63% on MARS without re-ranking. Our method also achieves a reasonable trade-off between ReID accuracy and model size, e.g., it saves about 40% parameters of I3D CNN.

[1]  Yunchao Wei,et al.  STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification , 2018, AAAI.

[2]  Longhui Wei,et al.  Person Transfer GAN to Bridge Domain Gap for Person Re-identification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Jiwen Lu,et al.  Localized multi-kernel discriminative canonical correlation analysis for video-based person re-identification , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Hantao Yao,et al.  Deep Representation Learning With Part Loss for Person Re-Identification , 2017, IEEE Transactions on Image Processing.

[7]  Larry S. Davis,et al.  Multi-Task Learning with Low Rank Attribute Embedding for Person Re-Identification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Lin Wu,et al.  Deep Recurrent Convolutional Networks for Video-based Person Re-identification: An End-to-End Approach , 2016, ArXiv.

[9]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[10]  Nico Karssemeijer,et al.  Deep multi-scale location-aware 3D convolutional neural networks for automated detection of lacunes of presumed vascular origin , 2016, NeuroImage: Clinical.

[11]  Q. Tian,et al.  GLAD: Global-Local-Alignment Descriptor for Pedestrian Retrieval , 2017, ACM Multimedia.

[12]  Jesús Martínez del Rincón,et al.  Recurrent Convolutional Network for Video-Based Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Liang Zheng,et al.  Re-ranking Person Re-identification with k-Reciprocal Encoding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Fei Xiong,et al.  Person Re-Identification Using Kernel-Based Metric Learning Methods , 2014, ECCV.

[15]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Francesco Solera,et al.  Performance Measures and a Data Set for Multi-target, Multi-camera Tracking , 2016, ECCV Workshops.

[17]  Xiaoyu Wang,et al.  Group-Group Loss-Based Global-Regional Feature Learning for Vehicle Re-Identification , 2019, IEEE Transactions on Image Processing.

[18]  Shiliang Zhang,et al.  Pose-Driven Deep Convolutional Model for Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[21]  Sergio A. Velastin,et al.  Local Fisher Discriminant Analysis for Pedestrian Re-identification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Jingdong Wang,et al.  OCNet: Object Context Network for Scene Parsing , 2018, ArXiv.

[24]  Jiwen Lu,et al.  Spatial-Temporal Attention-Aware Learning for Video-Based Person Re-Identification , 2019, IEEE Transactions on Image Processing.

[25]  Shiliang Zhang,et al.  Resolution-invariant Person Re-Identification , 2019, IJCAI.

[26]  Tao Xiang,et al.  Multi-scale Deep Learning Architectures for Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Bingbing Ni,et al.  Deep Cross-Modality Alignment for Multi-Shot Person Re-IDentification , 2017, ACM Multimedia.

[28]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[29]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Marcello Pelillo,et al.  Multi-target Tracking in Multiple Non-overlapping Cameras Using Fast-Constrained Dominant Sets , 2019, International Journal of Computer Vision.

[31]  Jiwen Lu,et al.  Learning Discriminative Aggregation Network for Video-Based Face Recognition and Person Re-identification , 2018, International Journal of Computer Vision.

[32]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Tsukasa Ogasawara,et al.  Temporal-Enhanced Convolutional Network for Person Re-Identification , 2018, AAAI.

[35]  Shuicheng Yan,et al.  Video-Based Person Re-Identification With Accumulative Motion Context , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[36]  Zhi Zhang,et al.  Supervised Deep Feature Embedding With Handcrafted Feature , 2019, IEEE Transactions on Image Processing.

[37]  Bingbing Ni,et al.  Person Re-identification via Recurrent Feature Aggregation , 2016, ECCV.

[38]  Shiguang Shan,et al.  VRSTC: Occlusion-Free Video Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Gang Wang,et al.  Dual Attention Matching Network for Context-Aware Feature Sequence Based Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Yun Fu,et al.  Discriminative Semi-Coupled Projective Dictionary Learning for Low-Resolution Person Re-Identification , 2018, AAAI.

[42]  Horst Bischof,et al.  Large scale metric learning from equivalence constraints , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Yun Fu,et al.  Support Neighbor Loss for Person Re-Identification , 2018, ACM Multimedia.

[44]  Konstantinos Kamnitsas,et al.  Multi-scale 3D convolutional neural networks for lesion segmentation in brain MRI , 2015 .

[45]  Afshin Dehghan,et al.  GMMCP tracker: Globally optimal Generalized Maximum Multi Clique problem for multiple object tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Qi Tian,et al.  MARS: A Video Benchmark for Large-Scale Person Re-Identification , 2016, ECCV.

[47]  Yongdong Zhang,et al.  Dense 3D-Convolutional Neural Network for Person Re-Identification in Videos , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[48]  Tiejun Huang,et al.  Multi-scale 3D Convolution Network for Video Based Person Re-Identification , 2018, AAAI.

[49]  Xiaogang Wang,et al.  Video Person Re-identification with Competitive Snippet-Similarity Aggregation and Co-attentive Snippet Embedding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[51]  Shaogang Gong,et al.  Harmonious Attention Network for Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Xiaogang Wang,et al.  Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Kan Liu,et al.  Learning Compact Appearance Representation for Video-Based Person Re-Identification , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[54]  Yu Cheng,et al.  Jointly Attentive Spatial-Temporal Pooling Networks for Video-Based Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Yu Liu,et al.  Quality Aware Network for Set to Set Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[58]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[59]  Kun Yu,et al.  DenseASPP for Semantic Segmentation in Street Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Edward J. Delp,et al.  A Two Stream Siamese Convolutional Neural Network for Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[61]  Qi Tian,et al.  Pose-Guided Representation Learning for Person Re-Identification , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[63]  Liang Wang,et al.  Mask-Guided Contrastive Attention Model for Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64]  Yang Li,et al.  Person Re-Identification with Discriminatively Trained Viewpoint Invariant Dictionaries , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[65]  Shaogang Gong,et al.  Unsupervised Person Re-identification by Deep Learning Tracklet Association , 2018, ECCV.

[66]  Shiliang Zhang,et al.  RAM: A Region-Aware Deep Model for Vehicle Re-Identification , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[67]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[68]  Zhen Zhou,et al.  See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-Based Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Tao Mei,et al.  Part-Aligned Bilinear Representations for Person Re-identification , 2018, ECCV.

[70]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Guisong Liu,et al.  Self Attention Grid for Person Re-Identification , 2018, ArXiv.

[73]  Houqiang Li,et al.  Spatial and Temporal Mutual Promotion for Video-based Person Re-identification , 2018, AAAI.

[74]  Jiwen Lu,et al.  Consistent-Aware Deep Learning for Person Re-identification in a Camera Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Yu Wu,et al.  Exploit the Unknown Gradually: One-Shot Video-Based Person Re-identification by Stepwise Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[76]  Wei-Shi Zheng,et al.  Unsupervised Person Re-Identification by Deep Asymmetric Metric Embedding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[77]  Xiaogang Wang,et al.  Person Re-identification: System Design and Evaluation Overview , 2014, Person Re-Identification.

[78]  Ming Shao,et al.  Person Re-Identification by Cross-View Multi-Level Dictionary Learning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[79]  Hao Chen,et al.  3D multi‐scale FCN with random modality voxel dropout learning for Intervertebral Disc Localization and Segmentation from Multi‐modality MR Images , 2018, Medical Image Anal..

[80]  Kaiqi Huang,et al.  Learning Deep Context-Aware Features over Body and Latent Parts for Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Horst Bischof,et al.  Person Re-identification by Descriptive and Discriminative Classification , 2011, SCIA.

[84]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[85]  Bingbing Ni,et al.  Multi-level attention model for person re-identification , 2019, Pattern Recognit. Lett..

[86]  Qi Wang,et al.  Self-attention Learning for Person Re-identification , 2018, BMVC.

[87]  Hongtao Lu,et al.  Attribute-Driven Feature Disentangling and Temporal Aggregation for Video Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[88]  Qi Tian,et al.  Video-Based Person Re-identification by Deep Feature Guided Pooling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[89]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).