Epipolar Transformer for Multi-view Human Pose Estimation

A common way to localize 3D human joints in a synchronized and calibrated multi-view setup is a two-step process: (1) apply a 2D detector separately on each view to localize joints in 2D, (2) robust triangulation on 2D detections from each view to acquire the 3D joint locations. However, in step 1, the 2D detector is constrained to solve challenging cases which could be better resolved in 3D, such as occlusions and oblique viewing angles, purely in 2D without leveraging any 3D information. Therefore, we propose the differentiable "epipolar transformer", which empowers the 2D detector to leverage 3D-aware features to improve 2D pose estimation. The intuition is: given a 2D location p in the reference view, we would like to first find its corresponding point p' in the source view, then combine the features at p' with the features at p, thus leading to a more 3D-aware feature at p. Inspired by stereo matching, the epipolar transformer leverages epipolar constraints and feature matching to approximate the features at p′. The key advantages of the epipolar transformer are: (1) it has minimal learnable parameters, (2) itcanbeeasilypluggedintoexistingnetworks,moreover (3) it is interpretable, i.e., we can analyze the location p' to understand whether matching over the epipolar line was successful. Experiments on Human3.6M [9] show that our approach has consistent improvements over the baselines. Specifically, in the condition where no external data is used, our Human3.6M model trained with ResNet-50 and image size 256 x 256 outperforms state-of-the-art by a large margin and achieves MPJPE 26.9 mm. Code is available1. This is the workshop version ofour CVPR 2020 paper [8]

[1]  Victor Lempitsky,et al.  Learnable Triangulation of Human Pose , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Wenjun Zeng,et al.  Cross View Fusion for 3D Human Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[6]  Jianfei Cai,et al.  3D Hand Shape and Pose Estimation from a Single RGB Image (Supplementary Material) , 2019 .

[7]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Cristian Sminchisescu,et al.  Latent structured models for human pose estimation , 2011, 2011 International Conference on Computer Vision.

[9]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Nicolas Padoy,et al.  A generalizable approach for multi-view 3D human pose regression , 2018, Machine Vision and Applications.

[13]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Xiaowei Zhou,et al.  Harvesting Multiple Views for Marker-Less 3D Human Pose Annotations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yichen Wei,et al.  Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[17]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Qiang Li,et al.  End-to-End Hand Mesh Recovery From a Monocular RGB Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Yuan Yao,et al.  MONET: Multiview Semi-supervised Keypoint via Epipolar Divergence , 2018, ArXiv.

[20]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[21]  Lourdes Agapito,et al.  Rethinking Pose in 3D: Multi-stage Refinement and Recovery for Markerless Motion Capture , 2018, 2018 International Conference on 3D Vision (3DV).