3-D PersonVLAD: Learning Deep Global Representations for Video-Based Person Reidentification

We present the global deep video representation learning to video-based person reidentification (re-ID) that aggregates local 3-D features across the entire video extent. Existing methods typically extract frame-wise deep features from 2-D convolutional networks (ConvNets) which are pooled temporally to produce the video-level representations. However, 2-D ConvNets lose temporal priors immediately after the convolutions, and a separate temporal pooling is limited in capturing human motion in short sequences. In this paper, we present global video representation learning, to be complementary to 3-D ConvNets as a novel layer to capture the appearance and motion dynamics in full-length videos. Nevertheless, encoding each video frame in its entirety and computing aggregate global representations across all frames is tremendously challenging due to the occlusions and misalignments. To resolve this, our proposed network is further augmented with the 3-D part alignment to learn local features through the soft-attention module. These attended features are statistically aggregated to yield identity-discriminative representations. Our global 3-D features are demonstrated to achieve the state-of-the-art results on three benchmark data sets: MARS, Imagery Library for Intelligent Detection Systems-Video Re-identification, and PRID2011.

[1]  Kuk-Jin Yoon,et al.  Improving Person Re-identification via Pose-Aware Multi-shot Matching , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Shaogang Gong,et al.  Person Re-Identification by Unsupervised Video Matching , 2016, Pattern Recognit..

[3]  Gang Wang,et al.  Gated Siamese Convolutional Neural Network Architecture for Human Re-identification , 2016, ECCV.

[4]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Takahiro Okabe,et al.  Hierarchical Gaussian Descriptor for Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Xiaojing Chen,et al.  Multi-Level Common Space Learning for Person Re-Identification , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Yun Fu,et al.  Discriminative Semi-Coupled Projective Dictionary Learning for Low-Resolution Person Re-Identification , 2018, AAAI.

[11]  Xiaogang Wang,et al.  Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Bingbing Ni,et al.  Person Re-identification via Recurrent Feature Aggregation , 2016, ECCV.

[13]  Xiang Li,et al.  Top-Push Video-Based Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Lin Wu,et al.  Effective Multi-Query Expansions: Robust Landmark Retrieval , 2015, ACM Multimedia.

[15]  Lin Wu,et al.  Deep Linear Discriminant Analysis on Fisher Networks: A Hybrid Architecture for Person Re-identification , 2016, Pattern Recognit..

[16]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[17]  Lin Wu,et al.  Where-and-When to Look: Deep Siamese Attention Networks for Video-Based Person Re-Identification , 2018, IEEE Transactions on Multimedia.

[18]  Edward J. Delp,et al.  A Two Stream Siamese Convolutional Neural Network for Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Jianxin Wu,et al.  Person Re-Identification with Correspondence Structure Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Gang Wang,et al.  A Siamese Long Short-Term Memory Architecture for Human Re-identification , 2016, ECCV.

[21]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22]  Bingpeng Ma,et al.  A Spatio-Temporal Appearance Representation for Video-Based Pedestrian Re-Identification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Lin Wu,et al.  What-and-Where to Match: Deep Spatially Multiplicative Integration Networks for Person Re-identification , 2017, Pattern Recognit..

[25]  Afshin Dehghan,et al.  GMMCP tracker: Globally optimal Generalized Maximum Multi Clique problem for multiple object tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Zhen Zhou,et al.  See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-Based Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jingdong Wang,et al.  Deeply-Learned Part-Aligned Representations for Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Albert Gordo,et al.  End-to-End Learning of Deep Visual Representations for Image Retrieval , 2016, International Journal of Computer Vision.

[29]  Nanning Zheng,et al.  Person Re-identification by Multi-Channel Parts-Based CNN with Improved Triplet Loss Function , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Fu Xiong,et al.  Towards Good Practices on Building Effective CNN Baseline Model for Person Re-identification , 2018, ArXiv.

[31]  Lin Wu,et al.  Deep adaptive feature embedding with local sample distributions for person re-identification , 2017, Pattern Recognit..

[32]  Shengcai Liao,et al.  Embedding Deep Metric for Person Re-identification: A Study Against Large Variations , 2016, ECCV.

[33]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[34]  Horst Bischof,et al.  Large scale metric learning from equivalence constraints , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Jesús Martínez del Rincón,et al.  Recurrent Convolutional Network for Video-Based Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[37]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[38]  Xiaogang Wang,et al.  Joint Detection and Identification Feature Learning for Person Search , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Kaiqi Huang,et al.  Learning Deep Context-Aware Features over Body and Latent Parts for Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Xiaogang Wang,et al.  HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Nanning Zheng,et al.  Similarity Learning with Spatial Constraints for Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[44]  Jinhui Tang,et al.  Robust Structured Nonnegative Matrix Factorization for Image Representation , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[45]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[46]  Lin Wu,et al.  Deep Recurrent Convolutional Networks for Video-based Person Re-identification: An End-to-End Approach , 2016, ArXiv.

[47]  Lin Wu,et al.  Effective Multi-Query Expansions: Collaborative Deep Networks for Robust Landmark Retrieval , 2017, IEEE Transactions on Image Processing.

[48]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Shaogang Gong,et al.  Person Re-identification by Video Ranking , 2014, ECCV.

[50]  Huchuan Lu,et al.  Pose-Invariant Embedding for Deep Person Re-Identification , 2017, IEEE Transactions on Image Processing.

[51]  Tao Mei,et al.  Deep Collaborative Embedding for Social Image Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Horst Bischof,et al.  Person Re-identification by Descriptive and Discriminative Classification , 2011, SCIA.

[53]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Qi Tian,et al.  Scalable Person Re-identification on Supervised Smoothed Manifold , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Xue Li,et al.  Deep Attention-Based Spatially Recursive Networks for Fine-Grained Visual Recognition , 2019, IEEE Transactions on Cybernetics.

[56]  Ling Shao,et al.  Cycle-Consistent Deep Generative Hashing for Cross-Modal Retrieval , 2018, IEEE Transactions on Image Processing.

[57]  Lin Wu,et al.  Robust Subspace Clustering for Multi-View Data by Exploiting Correlation Consensus , 2015, IEEE Transactions on Image Processing.

[58]  Yu Cheng,et al.  Jointly Attentive Spatial-Temporal Pooling Networks for Video-Based Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[59]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Xiaogang Wang,et al.  Spindle Net: Person Re-identification with Human Body Region Guided Feature Decomposition and Fusion , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Shengcai Liao,et al.  Person re-identification by Local Maximal Occurrence representation and metric learning , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Yuji Sato,et al.  Video-Based Person Re-identification by 3D Convolutional Neural Networks and Improved Parameter Learning , 2018, ICIAR.

[64]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  David Zhang,et al.  Joint Learning of Single-Image and Cross-Image Representations for Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[68]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Xiao-Yuan Jing,et al.  Video-Based Person Re-Identification by Simultaneously Learning Intra-Video and Inter-Video Distance Metrics , 2016, IEEE Transactions on Image Processing.

[70]  Qi Tian,et al.  MARS: A Video Benchmark for Large-Scale Person Re-Identification , 2016, ECCV.

[71]  Sharath Pankanti,et al.  The relation between the ROC curve and the CMC , 2005, Fourth IEEE Workshop on Automatic Identification Advanced Technologies (AutoID'05).

[72]  Lior Wolf,et al.  RNN Fisher Vectors for Action Recognition and Image Annotation , 2015, ECCV.

[73]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[74]  Yue Gao,et al.  Beyond Pairwise Matching: Person Reidentification via High-Order Relevance Learning , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[75]  Xiaogang Wang,et al.  Learning Deep Feature Representations with Domain Guided Dropout for Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.