UrbanPose: A New Benchmark for VRU Pose Estimation in Urban Traffic Scenes

Human pose, serving as a robust appearance-invariant mid-level feature, has proven to be effective and efficient for human action recognition and intention estimation. Pose features also have a great potential to improve trajectory prediction for the Vulnerable Road User (VRU) in ADAS or automated driving applications. However, the lack of highly diverse and large VRU pose datasets makes a transfer and application to the VRU rather difficult. This paper introduces the Tsinghua-Daimler Urban Pose dataset (TDUP), a large-scale 2D VRU pose image dataset collected in Chinese urban traffic environments from on-board a moving vehicle. The TDUP dataset contains 21k images with more than 90k high-quality, manually labeled VRU bounding boxes with pose keypoint annotations and additional tags. We optimize four state-of-the-art deep learning approaches (AlphaPose, Mask R-CNN, Pose-SSD and PitPaf) to serve as baselines for the new pose estimation benchmark. We further analyze the effect of using large pre-training datasets and different data proportions as well as optional labeled information during training. Our new benchmark is expected to lay the foundation for further VRU pose studies and to empower the development of accurate VRU trajectory prediction methods in complex urban traffic scenes. The dataset (including an evaluation server) is available on www.urbanpose-dataset.com for non-commercial scientific use.