CAPformer: Pedestrian Crossing Action Prediction Using Transformer

Anticipating pedestrian crossing behavior in urban scenarios is a challenging task for autonomous vehicles. Early this year, a benchmark comprising JAAD and PIE datasets have been released. In the benchmark, several state-of-the-art methods have been ranked. However, most of the ranked temporal models rely on recurrent architectures. In our case, we propose, as far as we are concerned, the first self-attention alternative, based on transformer architecture, which has had enormous success in natural language processing (NLP) and recently in computer vision. Our architecture is composed of various branches which fuse video and kinematic data. The video branch is based on two possible architectures: RubiksNet and TimeSformer. The kinematic branch is based on different configurations of transformer encoder. Several experiments have been performed mainly focusing on pre-processing input data, highlighting problems with two kinematic data sources: pose keypoints and ego-vehicle speed. Our proposed model results are comparable to PCPA, the best performing model in the benchmark reaching an F1 Score of nearly 0.78 against 0.77. Furthermore, by using only bounding box coordinates and image data, our model surpasses PCPA by a larger margin (F1=0.75 vs. F1=0.72). Our model has proven to be a valid alternative to recurrent architectures, providing advantages such as parallelization and whole sequence processing, learning relationships between samples not possible with recurrent architectures.

[1]  Saeid Nahavandi,et al.  Real-time Intent Prediction of Pedestrians for Autonomous Ground Vehicles via Spatio-Temporal DenseNet , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[2]  John K. Tsotsos,et al.  Are They Going to Cross? A Benchmark Dataset and Baseline for Pedestrian Crosswalk Behavior , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[3]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[4]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Björn Ommer,et al.  Learning to Forecast Pedestrian Intention from Pose Dynamics , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[6]  Amir Rasouli,et al.  Graph-SIM: A Graph-based Spatiotemporal Interaction Modelling for Pedestrian Action Prediction , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Guillaume Bresson,et al.  Predicting Intentions of Pedestrians from 2D Skeletal Pose Sequences with a Representation-Focused Multi-Branch Deep Learning Network , 2020, Algorithms.

[8]  Gedas Bertasius,et al.  Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[9]  Amir Rasouli,et al.  PePScenes: A Novel Dataset and Baseline for Pedestrian Action Prediction in 3D , 2020, ArXiv.

[10]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Ming Yang,et al.  Pedestrian Graph: Pedestrian Crossing Prediction Based on 2D Pose Estimation and Graph Convolutional Networks , 2019, 2019 IEEE Intelligent Transportation Systems Conference (ITSC).

[13]  John K. Tsotsos,et al.  Joint Attention in Driver-Pedestrian Interaction: from Theory to Practice , 2018, ArXiv.

[14]  Antonio M. López,et al.  Is the Pedestrian going to Cross? Answering by 2D Pose Estimation , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[15]  Umit Ozguner,et al.  Predicting Pedestrian Crossing Intention With Feature Fusion and Spatio-Temporal Attention , 2021, IEEE Transactions on Intelligent Vehicles.

[16]  L. Srikar Muppirisetty,et al.  FuSSI-Net: Fusion of Spatio-temporal Skeletons for Intention Prediction Network , 2020, 2020 54th Asilomar Conference on Signals, Systems, and Computers.

[17]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Emilie Wirbel,et al.  VRUNet: Multi-Task Learning Model for Intent Prediction of Vulnerable Road Users , 2020, Autonomous Vehicles and Machines.

[19]  Richard Vaughan,et al.  Classifying Pedestrian Actions In Advance Using Predicted Video Of Urban Driving Scenes , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[20]  Ignacio Parra,et al.  RNN-based Pedestrian Crossing Prediction using Activity and Pose-related Features , 2020, 2020 IEEE Intelligent Vehicles Symposium (IV).

[21]  Dariu M. Gavrila,et al.  Human motion trajectory prediction: a survey , 2019, Int. J. Robotics Res..

[22]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  W. Sievert European New Car Assessment Programme (Euro NCAP) , 2000 .

[24]  Amir Rasouli,et al.  Benchmark for Evaluating Pedestrian Action Prediction , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[25]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[26]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Mohamed Chaabane,et al.  Looking Ahead: Anticipating Pedestrians Crossing with Future Frames Prediction , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[28]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[29]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Juan Carlos Niebles,et al.  Spatiotemporal Relationship Reasoning for Pedestrian Intent Prediction , 2020, IEEE Robotics and Automation Letters.

[31]  Yingfeng Cai,et al.  Crossing or Not? Context-Based Recognition of Pedestrian Crossing Intention in the Urban Environment , 2021, IEEE Transactions on Intelligent Transportation Systems.

[32]  John K. Tsotsos,et al.  Pedestrian Action Anticipation using Contextual Feature Fusion in Stacked RNNs , 2020, BMVC.

[33]  Andrew Zisserman,et al.  A Short Note about Kinetics-600 , 2018, ArXiv.

[34]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[35]  John K. Tsotsos,et al.  PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Xinyu Li,et al.  A Comprehensive Study of Deep Video Action Recognition , 2020, ArXiv.

[37]  Germán Ros,et al.  CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[38]  Bernt Schiele,et al.  Long-Term On-board Prediction of People in Traffic Scenes Under Uncertainty , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Juan Carlos Niebles,et al.  RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition , 2020, ECCV.

[40]  Fahad Shahbaz Khan,et al.  Transformers in Vision: A Survey , 2021, ACM Comput. Surv..

[41]  Fawzi Nashashibi,et al.  Multi-Task Deep Learning for Pedestrian Detection, Action Recognition and Time to Cross Prediction , 2019, IEEE Access.