Spatio-Temporal Representation Factorization for Video-based Person Re-Identification

Despite much recent progress in video-based person re-identification (re-ID), the current state-of-the-art still suffers from common real-world challenges such as appearance similarity among various people, occlusions, and frame misalignment. To alleviate these problems, we propose Spatio-Temporal Representation Factorization (STRF), a flexible new computational unit that can be used in conjunction with most existing 3D convolutional neural network architectures for re-ID. The key innovations of STRF over prior work include explicit pathways for learning discriminative temporal and spatial features, with each component further factorized to capture complementary person-specific appearance and motion information. Specifically, temporal factorization comprises two branches, one each for static features (e.g., the color of clothes) that do not change much over time, and dynamic features (e.g., walking patterns) that change over time. Further, spatial factorization also comprises two branches to learn both global (coarse segments) as well as local (finer segments) appearance features, with the local features particularly useful in cases of occlusion or spatial misalignment. These two factorization operations taken together result in a modular architecture for our parameter-wise light STRF unit that can be plugged in between any two 3D convolutional layers, resulting in an end-to-end learning framework. We empirically show that STRF improves performance of various existing baseline architectures while demonstrating new state-of-the-art results using standard person re-ID evaluation protocols on three benchmarks.

[1]  Shaogang Gong,et al.  Person Re-identification by Video Ranking , 2014, ECCV.

[2]  Hongtao Lu,et al.  Attribute-Driven Feature Disentangling and Temporal Aggregation for Video Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Shengcai Liao,et al.  Person re-identification by Local Maximal Occurrence representation and metric learning , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yasushi Makihara,et al.  Gait Recognition by Deformable Registration , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[7]  Cuiling Lan,et al.  Style Normalization and Restitution for Generalizable Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Qi Tian,et al.  MARS: A Video Benchmark for Large-Scale Person Re-Identification , 2016, ECCV.

[9]  Shiguang Shan,et al.  VRSTC: Occlusion-Free Video Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[12]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Han Xu,et al.  Temporal Aggregation with Clip-level Attention for Video-based Person Re-identification , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]  Jie Zhou,et al.  Temporal Coherence or Temporal Motion: Which Is More Critical for Video-Based Person Re-identification? , 2020, ECCV.

[15]  Amir Erfan Eshratifar,et al.  Video Person Re-ID: Fantastic Techniques and Where to Find Them , 2020, AAAI.

[16]  Rongrong Ji,et al.  Salience-Guided Cascaded Suppression Network for Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Lingxiao He,et al.  Video-based Person Re-identification via 3D Convolutional Networks and Non-local Attention , 2018, ACCV.

[19]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Xiaogang Wang,et al.  Video Person Re-identification with Competitive Snippet-Similarity Aggregation and Co-attentive Snippet Embedding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Amit K. Roy-Chowdhury,et al.  Non-Adversarial Video Synthesis with Learned Priors , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yu Cheng,et al.  Jointly Attentive Spatial-Temporal Pooling Networks for Video-Based Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Amit K. Roy-Chowdhury,et al.  Ada-VSR: Adaptive Video Super-Resolution with Meta-Learning , 2021, ACM Multimedia.

[25]  Anurag Mittal,et al.  Co-Segmentation Inspired Attention Networks for Video-Based Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Francesco Solera,et al.  Performance Measures and a Data Set for Multi-target, Multi-camera Tracking , 2016, ECCV Workshops.

[27]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Ling Shao,et al.  Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yu Wu,et al.  Exploit the Unknown Gradually: One-Shot Video-Based Person Re-identification by Stepwise Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Ziyan Wu,et al.  From the Lab to the Real World: Re-identification in an Airport Camera Network , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[31]  Cuiling Lan,et al.  Relation-Aware Global Attention for Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ramakant Nevatia,et al.  Revisiting Temporal Modeling for Video-based Person ReID , 2018, ArXiv.

[33]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[34]  Zhen Zhou,et al.  See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-Based Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Amit K. Roy-Chowdhury,et al.  ALANET: Adaptive Latent Attention Network for Joint Video Deblurring and Interpolation , 2020, ACM Multimedia.

[36]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Tiejun Huang,et al.  Multi-Scale Temporal Cues Learning for Video Person Re-Identification , 2019, IEEE Transactions on Image Processing.

[38]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[39]  Xiaogang Wang,et al.  Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  Wenjun Zeng,et al.  Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-Based Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Afshin Dehghan,et al.  GMMCP tracker: Globally optimal Generalized Maximum Multi Clique problem for multiple object tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Xilin Chen,et al.  Temporal Complementary Learning for Video Person Re-Identification , 2020, ECCV.

[46]  Richard I. Hartley,et al.  Person Reidentification Using Spatiotemporal Appearance , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[47]  Hai Tao,et al.  Evaluating Appearance Models for Recognition, Reacquisition, and Tracking , 2007 .

[48]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[51]  Yun Fu,et al.  Image Super-Resolution Using Very Deep Residual Channel Attention Networks , 2018, ECCV.

[52]  Jan Flusser,et al.  Image registration methods: a survey , 2003, Image Vis. Comput..

[53]  Shu-Tao Xia,et al.  Second-Order Attention Network for Single Image Super-Resolution , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Boqiang Liu,et al.  S3D-UNet: Separable 3D U-Net for Brain Tumor Segmentation , 2018, BrainLes@MICCAI.

[55]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[56]  Ziyan Wu,et al.  A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Xilin Chen,et al.  Appearance-Preserving 3D Convolution for Video-based Person Re-identification , 2020, ECCV.

[58]  Ziyan Wu,et al.  Re-Identification With Consistent Attentive Siamese Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Shaogang Gong,et al.  Harmonious Attention Network for Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Qi Tian,et al.  Beyond Part Models: Person Retrieval with Refined Part Pooling , 2017, ECCV.

[61]  Wei-Shi Zheng,et al.  Spatial-Temporal Graph Convolutional Network for Video-Based Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).