MRG-T: Mask-Relation-Guided Transformer for Remote Vision-Based Pedestrian Attribute Recognition in Aerial Imagery

Nowadays, with the rapid development of consumer Unmanned Aerial Vehicles (UAVs), utilizing UAV platforms for visual surveillance has become very attractive, and a key part of this is remote vision-based pedestrian attribute recognition. Pedestrian Attribute Recognition (PAR) is dedicated to predicting multiple attribute labels of a single pedestrian image extracted from surveillance videos and aerial imagery, which presents significant challenges in the computer vision community due to factors such as poor imaging quality and substantial pose variations. Despite recent studies demonstrating impressive advancements in utilizing complicated architectures and exploring relations, most of them may fail to fully and systematically consider the inter-region, inter-attribute, and region-attribute mapping relations simultaneously and be stuck in the dilemma of information redundancy, leading to the degradation of recognition accuracy. To address the issues, we construct a novel Mask-Relation-Guided Transformer (MRG-T) framework that consists of three relation modeling modules to fully exploit spatial and semantic relations in the model learning process. Specifically, we first propose a Masked Region Relation Module (MRRM) to focus on precise spatial attention regions to extract more robust features with masked random patch training. To explore the semantic association of attributes, we further present a Masked Attribute Relation Module (MARM) to extract intrinsic and semantic inter-attribute relations with an attribute label masking strategy. Based on the cross-attention mechanism, we finally design a Region and Attribute Mapping Module (RAMM) to learn the cross-modal alignment between spatial regions and semantic attributes. We conduct comprehensive experiments on three public benchmarks such as PETA, PA-100K, and RAPv1, and conduct inference on a large-scale airborne person dataset named PRAI-1581. The extensive experimental results demonstrate the superior performance of our method compared to state-of-the-art approaches and validate the effectiveness of mask-relation-guided modeling in the remote vision-based PAR task.

[1]  Minghao Lu,et al.  Learning discriminative feature representation with pixel-level supervision for forest smoke recognition , 2023, Pattern Recognit..

[2]  Yang Lu,et al.  PARFormer: Transformer-Based Multi-Task Network for Pedestrian Attribute Recognition , 2023, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Shaohui Mei,et al.  A Remote-Vision-Based Safety Helmet and Harness Monitoring System Based on Attribute Knowledge Modeling , 2023, Remote. Sens..

[4]  Qian Wang,et al.  A Simple Visual-Textual Baseline for Pedestrian Attribute Recognition , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[5]  P. Zhang,et al.  Dual-branch self-attention network for pedestrian attribute recognition , 2022, Pattern Recognit. Lett..

[6]  S. Gou,et al.  A Mask-Guided Transformer Network with Topic Token for Remote Sensing Image Captioning , 2022, Remote. Sens..

[7]  Junyi Wu,et al.  Inter-Attribute awareness for pedestrian attribute recognition , 2022, Pattern Recognit..

[8]  Zengming Tang,et al.  DRFormer: Learning dual relations using Transformer for pedestrian attribute recognition , 2022, Neurocomputing.

[9]  Songhe Feng,et al.  Pedestrian attribute recognition based on attribute correlation , 2022, Multimedia Systems.

[10]  Jinghong Liu,et al.  Multi-Exposure Image Fusion Techniques: A Comprehensive Review , 2022, Remote. Sens..

[11]  Qinmu Peng,et al.  TransZero: Attribute-guided Transformer for Zero-Shot Learning , 2021, AAAI.

[12]  R. Canals,et al.  Vision Transformers For Weeds and Crops Classification Of High Resolution UAV Images , 2021, Remote. Sens..

[13]  Noor Almaadeed,et al.  Applications, databases and open computer vision research from drone videos and images: a survey , 2021, Artif. Intell. Rev..

[14]  Fahad Shahbaz Khan,et al.  Transformers in Vision: A Survey , 2021, ACM Comput. Surv..

[15]  Yang Yang,et al.  Relation-Aware Pedestrian Attribute Recognition with Graph Convolutional Networks , 2020, AAAI.

[16]  Hao Liu,et al.  Person Attribute Recognition by Sequence Contextual Relation Learning , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[17]  I. Tetko,et al.  State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis , 2020, Nature Communications.

[18]  Qinghua Hu,et al.  Detection and Tracking Meet Drones Challenge , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Yanning Zhang,et al.  Person Re-Identification in Aerial Imagery , 2019, IEEE Transactions on Multimedia.

[20]  Xin Zhao,et al.  Recurrent Attention Model for Pedestrian Attribute Recognition , 2019, AAAI.

[21]  Qiaozhe Li,et al.  Visual-Semantic Graph Reasoning for Pedestrian Attribute Recognition , 2019, AAAI.

[22]  Jun Wan,et al.  Attention-Based Pedestrian Attribute Analysis , 2019, IEEE Transactions on Image Processing.

[23]  Kaiqi Huang,et al.  A Richly Annotated Pedestrian Dataset for Person Retrieval in Real Surveillance Scenarios , 2019, IEEE Transactions on Image Processing.

[24]  B. Luo,et al.  Pedestrian Attribute Recognition: A Survey , 2019, Pattern Recognit..

[25]  Xin Zhao,et al.  Grouping Attribute Recognition for Pedestrian with Joint Recurrent Learning , 2018, IJCAI.

[26]  Yihong Gong,et al.  Tracking Persons-of-Interest via Unsupervised Representation Adaptation , 2017, International Journal of Computer Vision.

[27]  Liang Zheng,et al.  Improving Person Re-identification by Attribute and Identity Learning , 2017, Pattern Recognit..

[28]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Xiaoou Tang,et al.  Pedestrian Attribute Recognition At Far Distance , 2014, ACM Multimedia.

[30]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[31]  Shaohui Mei,et al.  Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval , 2023, IEEE Transactions on Geoscience and Remote Sensing.

[32]  I. Joe,et al.  An Adaptive Masked Attention Mechanism to Act on the Local Text in a Global Context for Aspect-Based Sentiment Analysis , 2023, IEEE Access.

[33]  Geonu Lee,et al.  STDP-Net: Improved Pedestrian Attribute Recognition Using Swin Transformer and Semantic Self-Attention , 2022, IEEE Access.

[34]  Cunbao Lin,et al.  Object Tracking in Satellite Videos Based on Correlation Filter with Multi-Feature Fusion and Motion Trajectory Compensation , 2022, Remote. Sens..

[35]  Ping Li,et al.  Person Retrieval in Surveillance Videos Via Deep Attribute Mining and Reasoning , 2021, IEEE Transactions on Multimedia.

[36]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .