Person Retrieval in Video Surveillance Using Deep Learning-Based Instance Segmentation

Video surveillance systems are deployed at many places such as airports, train stations, and malls for security and monitoring purposes. However, it is laborious to search for and retrieve persons in multicamera surveillance systems, especially with cluttered backgrounds and appearance variations among multiple cameras. To solve these problems, this paper proposes a person retrieval method that extracts the attributes of a masked image using an instance segmentation module for each object of interest. It uses attributes such as color and type of clothes to describe a person. The proposed person retrieval system involves four steps: (1) using the YOLACT++ model to perform pixelwise person segmentation, (2) conducting appearance-based attribute feature extraction using a multiple convolutional neural network classifier, (3) employing a search engine with a fundamental attribute matching approach, and (4) implementing a video summarization technique to produce a temporal abstraction of retrieved objects. Experimental results show that the proposed retrieval system can achieve effective retrieval performance and provide a quick overview of retrieved content for multicamera surveillance systems.

[1]  Yi Li,et al.  Instance-Sensitive Fully Convolutional Networks , 2016, ECCV.

[2]  Pradeep Kumar,et al.  Analysis of moving object detection and tracking in video surveillance system , 2015, 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom).

[3]  Yong Jae Lee,et al.  YOLACT++ Better Real-Time Instance Segmentation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Iasonas Kokkinos,et al.  Segmentation-Aware Convolutional Networks Using Local Attention Masks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Reza Fuad Rachmadi,et al.  Vehicle Color Recognition using Convolutional Neural Network , 2015, ArXiv.

[6]  Wai Lok Woo,et al.  A Lightweight Spatial and Temporal Multi-Feature Fusion Network for Defect Detection , 2020, IEEE Transactions on Image Processing.

[7]  Luc Van Gool,et al.  Semantic Instance Segmentation with a Discriminative Loss Function , 2017, ArXiv.

[8]  Min Bai,et al.  Deep Watershed Transform for Instance Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yael Pritch,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008 1 Non-Chronological Video , 2022 .

[10]  Yunchao Wei,et al.  Proposal-Free Network for Instance-Level Object Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Riccardo Satta,et al.  Dissimilarity-based people re-identification and search for intelligent video surveillance , 2013 .

[12]  Sumana Gupta,et al.  Smart surveillance based on video summarization , 2017, 2017 IEEE Region 10 Symposium (TENSYMP).

[13]  Dae-Seong Kang,et al.  An implementation of the video retrieval system by video segmentation , 2008, 2008 14th Asia-Pacific Conference on Communications.

[14]  Jorge S. Marques,et al.  Performance evaluation of object detection algorithms for video surveillance , 2006, IEEE Transactions on Multimedia.

[15]  Carsten Rother,et al.  InstanceCut: From Edges to Instances with MultiCut , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yi Yang,et al.  Unlabeled Samples Generated by GAN Improve the Person Re-identification Baseline in Vitro , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Francesco Solera,et al.  Performance Measures and a Data Set for Multi-target, Multi-camera Tracking , 2016, ECCV Workshops.

[18]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[19]  George Papandreou,et al.  MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Yan Yang,et al.  Content-Based Video Retrieval (CBVR) System for CCTV Surveillance Videos , 2009, 2009 Digital Image Computing: Techniques and Applications.

[21]  Yi Li,et al.  Fully Convolutional Instance-Aware Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Cheng Huang,et al.  A Novel Key-Frames Selection Framework for Comprehensive Video Summarization , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[23]  Trevor Darrell,et al.  PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Zhiao Huang,et al.  Associative Embedding: End-to-End Learning for Joint Detection and Grouping , 2016, NIPS.

[25]  Tao Yang,et al.  Robust People Detection and Tracking in a Multi-Camera Indoor Visual Surveillance System , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[26]  Wai Lok Woo,et al.  DeftectNet: Joint loss structured deep adversarial network for thermography defect detecting system , 2020, Neurocomputing.

[27]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  David Gerónimo Gómez,et al.  Unsupervised Surveillance Video Retrieval Based on Human Action and Appearance , 2014, ICPR.

[29]  Shih-Fu Chang,et al.  A fully automated content-based video search engine supporting spatiotemporal queries , 1998, IEEE Trans. Circuits Syst. Video Technol..

[30]  Peng Wang,et al.  Semantic Instance Segmentation via Deep Metric Learning , 2017, ArXiv.

[31]  Jia Xu,et al.  Identification of pedestrian attributes based on video sequence , 2018, 2018 IEEE International Conference on Advanced Manufacturing (ICAM).

[32]  Gang Wang,et al.  Multi-Task CNN Model for Attribute Prediction , 2015, IEEE Transactions on Multimedia.

[33]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[34]  Kaiqi Huang,et al.  Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios , 2015, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR).

[35]  Roger Y. Tsai,et al.  A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses , 1987, IEEE J. Robotics Autom..

[36]  Zhong Ji,et al.  Deep pedestrian attribute recognition based on LSTM , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[37]  Guoliang Fan,et al.  Combined key-frame extraction and object-based video segmentation , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[38]  Wai Lok Woo,et al.  Deep Temporal Convolution Network for Time Series Classification , 2021, Sensors.

[39]  Shu Liu,et al.  Path Aggregation Network for Instance Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Mehul S. Raval,et al.  Visual Appearance Based Person Retrieval in Unconstrained Environment Videos , 2019, Image Vis. Comput..

[41]  Shuicheng Yan,et al.  Human Parsing with Contextualized Convolutional Neural Network , 2017, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Seung-Jae Lee,et al.  Application of Recent Developments in Deep Learning to ANN-based Automatic Berthing Systems , 2020 .

[43]  Yongchao Gong,et al.  Mask Scoring R-CNN , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).