Temporal-Contextual Attention Network for Video-Based Person Re-identification

Video-based person re-identification aims to identify a specific person in surveillance videos from different cameras. This paper presents a new Temporal-Contextual Attention Network (TCA-Net) for person re-identification in videos. The TCA-Net exploits temporally local context among consecutive frames to concentrate selectively on crucial frames within a video sequence. Specifically, the network consists of a Convolutional Neural Network (CNN) module and a temporal-contextual attention block. The CNN module embeds each video frame into a convolutional representation, and the temporal-contextual attention block learns the importance of a video frame for re-identification by exploiting the local context among the frame and its neighboring frames. The feature of a video sequence is then obtained by aggregating frame-level features weighted by frame importance. We evaluate the proposed TCA-Net on a challenging dataset MARS. The experimental results have demonstrated the effectiveness of the proposed approach.

[1]  Jesús Martínez del Rincón,et al.  Recurrent Convolutional Network for Video-Based Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Qi Tian,et al.  MARS: A Video Benchmark for Large-Scale Person Re-Identification , 2016, ECCV.

[4]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[6]  Sharath Pankanti,et al.  The relation between the ROC curve and the CMC , 2005, Fourth IEEE Workshop on Automatic Identification Advanced Technologies (AutoID'05).

[7]  Shuicheng Yan,et al.  Video-Based Person Re-Identification With Accumulative Motion Context , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Xiaogang Wang,et al.  Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Shuicheng Yan,et al.  End-to-End Comparative Attention Networks for Person Re-Identification , 2016, IEEE Transactions on Image Processing.

[12]  Afshin Dehghan,et al.  GMMCP tracker: Globally optimal Generalized Maximum Multi Clique problem for multiple object tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Zhen Zhou,et al.  See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-Based Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Gang Wang,et al.  Part-based Tracking via Discriminative Correlation Filters , 2017 .

[16]  Meng Yang,et al.  Large-Margin Softmax Loss for Convolutional Neural Networks , 2016, ICML.

[17]  Yu Liu,et al.  Quality Aware Network for Set to Set Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Shengcai Liao,et al.  Person re-identification by Local Maximal Occurrence representation and metric learning , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Deqiang Ouyang,et al.  Video-based person re-identification via spatio-temporal attentional and two-stream fusion convolutional networks , 2019, Pattern Recognit. Lett..

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Roger Zimmermann,et al.  Flickr Circles: Aesthetic Tendency Discovery by Multi-View Regularized Topic Modeling , 2016, IEEE Transactions on Multimedia.

[23]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Xiang Li,et al.  Top-Push Video-Based Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Meng Wang,et al.  Multi-View Object Retrieval via Multi-Scale Topic Models , 2016, IEEE Transactions on Image Processing.