VRU Pose-SSD: Multiperson Pose Estimation For Automated Driving

We present a fast and efficient approach for joint person detection and pose estimation optimized for automated driving (AD) in urban scenarios. We use a multitask weight sharing architecture to jointly train detection and pose estimation. This modular architecture allows us to accommodate different downstream tasks in the future. By systematic large-scale experiments on the Tsinghua-Daimler Urban Pose Dataset (TDUP), we obtain multiple models with varying accuracyspeed trade-offs. We then quantize and optimize our network for deployment and present a detailed analysis of the efficacy of the algorithm. We introduce a two-stage evaluation strategy, which is more suitable for AD and achieves a significant performance improvement in comparison to state-ofthe-art approaches. Our optimized model runs at 52 fps on full HD images and still reaches a competitive performance of 32.25 LAMR. We are confident that our work serves as an enabler to tackle higher-level tasks like VRU intention estimation and gesture recognition, which rely on stable pose estimates and will play a crucial role in future AD systems.

[1]  Dariu Gavrila,et al.  Context-Based Path Prediction for Targets with Switching Dynamics , 2018, International Journal of Computer Vision.

[2]  Xi Chen,et al.  FxpNet : Training deep convolutional neural network in fixed-point representation , 2016 .

[3]  Michele Fenzi,et al.  Scalable Active Learning for Object Detection , 2020, 2020 IEEE Intelligent Vehicles Symposium (IV).

[4]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Nasser Kehtarnavaz,et al.  Deep Learning-based Human Pose Estimation: A Survey , 2020, ACM Comput. Surv..

[6]  Dariu Gavrila,et al.  EuroCity Persons: A Novel Benchmark for Person Detection in Traffic Scenes , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[8]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[10]  Hui Xiong,et al.  A Unified Framework for Concurrent Pedestrian and Cyclist Detection , 2017, IEEE Transactions on Intelligent Transportation Systems.

[11]  Michael Wu,et al.  Quantizing Convolutional Neural Networks for Low-Power High-Throughput Inference Engines , 2018, ArXiv.

[12]  Mayank Vatsa,et al.  AUTO-G: Gesture Recognition in the Crowd for Autonomous Vehicl , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[13]  HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Dariu M. Gavrila,et al.  Human motion trajectory prediction: a survey , 2019, Int. J. Robotics Res..

[15]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Hemendra Arya,et al.  Pose estimation for an autonomous vehicle using monocular vision , 2017, 2017 Indian Control Conference (ICC).

[18]  Kang-Hyun Jo,et al.  Pedestrian action recognition using motion type classification , 2015, 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF).

[19]  Junjie Huang,et al.  Multi-Stage HRNet: Multiple Stage High-Resolution Network for Human Pose Estimation , 2019, ArXiv.

[20]  Konrad Doll,et al.  Human Pose Estimation in Real Traffic Scenes , 2018, 2018 IEEE Symposium Series on Computational Intelligence (SSCI).

[21]  Pietro Perona,et al.  Pedestrian Detection: An Evaluation of the State of the Art , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Antonio M. López,et al.  Intention Recognition of Pedestrians and Cyclists by 2D Pose Estimation , 2019, IEEE Transactions on Intelligent Transportation Systems.

[23]  Dariu M. Gavrila,et al.  Advancing active safety towards the protection of vulnerable road users: the prospect project , 2017 .

[24]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[25]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Pietro Perona,et al.  Benchmarking and Error Diagnosis in Multi-instance Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Thomas Brox,et al.  Box2Pix: Single-Shot Instance Segmentation by Assigning Pixels to Object Boxes , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[30]  John K. Tsotsos,et al.  PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[32]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Sergio Guadarrama,et al.  Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[36]  Alexandre Alahi,et al.  PifPaf: Composite Fields for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Lorenzo Torresani,et al.  Detect-and-Track: Efficient Pose Estimation in Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.