Spatiotemporal Bundle Adjustment for Dynamic 3D Human Reconstruction in the Wild

Bundle adjustment jointly optimizes camera intrinsics and extrinsics and 3D point triangulation to reconstruct a static scene. The triangulation constraint, however, is invalid for moving points captured in multiple unsynchronized videos and bundle adjustment is not designed to estimate the temporal alignment between cameras. We present a spatiotemporal bundle adjustment framework that jointly optimizes four coupled sub-problems: estimating camera intrinsics and extrinsics, triangulating static 3D points, as well as sub-frame temporal alignment between cameras and computing 3D trajectories of dynamic points. Key to our joint optimization is the careful integration of physics-based motion priors within the reconstruction pipeline, validated on a large motion capture corpus of human subjects. We devise an incremental reconstruction and alignment algorithm to strictly enforce the motion prior during the spatiotemporal bundle adjustment. This algorithm is further made more efficient by a divide and conquer scheme while still maintaining high accuracy. We apply this algorithm to reconstruct 3D motion trajectories of human bodies in dynamic events captured by multiple uncalibrated and unsynchronized video cameras in the wild. To make the reconstruction visually more interpretable, we fit a statistical 3D human body model to the asynchronous video streams.Compared to the baseline, the fitting significantly benefits from the proposed spatiotemporal bundle adjustment procedure. Because the videos are aligned with sub-frame precision, we reconstruct 3D motion at much higher temporal resolution than the input videos.

[1]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[2]  Jianqin Zhou,et al.  On discrete cosine transform , 2011, ArXiv.

[3]  Yaser Sheikh,et al.  3D Trajectory Reconstruction under Perspective Projection , 2015, International Journal of Computer Vision.

[4]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Paulo Jorge Ramalho Oliveira,et al.  Synchronization of Two Independently Moving Cameras without Feature Correspondences , 2014, ECCV.

[6]  Ersin Yumer,et al.  Self-supervised Multi-view Person Association and Its Applications. , 2020, IEEE transactions on pattern analysis and machine intelligence.

[7]  Peter V. Gehler,et al.  Unite the People: Closing the Loop Between 3D and 2D Human Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Xiaowei Zhou,et al.  Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Nassir Navab,et al.  Estimation of Location Uncertainty for Scale Invariant Features Points , 2009, BMVC.

[11]  Zhaoyang Wang,et al.  Hyper-accurate flexible calibration technique for fringe-projection-based three-dimensional imaging , 2012 .

[12]  M. Pollefeys,et al.  VIDEO SYNCHRONIZATION VIA SPACE-TIME INTEREST POINT DISTRIBUTION , 2004 .

[13]  Yaser Sheikh,et al.  Separable Spatiotemporal Priors for Convex Reconstruction of Time-Varying 3D Point Clouds , 2014, ECCV.

[14]  Roland Siegwart,et al.  Rolling Shutter Camera Calibration , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Yinghao Huang,et al.  Towards Accurate Marker-Less Human Shape and Pose Estimation over Time , 2017, 2017 International Conference on 3D Vision (3DV).

[16]  Zuzana Kukelova,et al.  R6P - Rolling shutter absolute pose problem , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yaser Sheikh,et al.  Spatiotemporal Bundle Adjustment for Dynamic 3D Reconstruction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[19]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Hans-Peter Seidel,et al.  Feature-Based Multi-video Synchronization with Subframe Accuracy , 2012, DAGM/OAGM Symposium.

[21]  Jan-Michael Frahm,et al.  Spatio-Temporally Consistent Correspondence for Dense Dynamic Scene Modeling , 2016, ECCV.

[22]  Amnon Shashua,et al.  Trajectory Triangulation: 3D Reconstruction of Moving Points from a Monocular Image Sequence , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  R. Leighton,et al.  Feynman Lectures on Physics , 1971 .

[24]  Denis Simakov,et al.  Feature-Based Sequence-to-Sequence Matching , 2006, International Journal of Computer Vision.

[25]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[26]  Kiriakos N. Kutulakos,et al.  Linear Sequence-to-Sequence Alignment , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  H. C. Longuet-Higgins,et al.  A computer algorithm for reconstructing a scene from two projections , 1981, Nature.

[28]  Simon Lucey,et al.  General trajectory prior for Non-Rigid reconstruction , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Simon Baker,et al.  Lucas-Kanade 20 Years On: A Unifying Framework , 2004, International Journal of Computer Vision.

[30]  Jan-Michael Frahm,et al.  Building Rome on a Cloudless Day , 2010, ECCV.

[31]  Ersin Yumer,et al.  Self-supervised Learning of Motion Capture , 2017, NIPS.

[32]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Yaser Sheikh,et al.  Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Tao Xiang,et al.  Learning Generalisable Omni-Scale Representations for Person Re-Identification , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Ashok Veeraraghavan,et al.  SocialSync: Sub-Frame Synchronization in a Smartphone Camera Network , 2014, ECCV Workshops.

[36]  Wenjun Zeng,et al.  Densely Semantically Aligned Person Re-Identification , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Peter Kovesi,et al.  Motion Guided Video Sequence Synchronization , 2006, ACCV.

[38]  Jan-Michael Frahm,et al.  Self-Expressive Dictionary Learning for Dynamic 3D Reconstruction , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Andrew W. Fitzgibbon,et al.  Bundle Adjustment - A Modern Synthesis , 1999, Workshop on Vision Algorithms.

[40]  Richard Szeliski,et al.  Skeletal graphs for efficient structure from motion , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Andrew G. Cresswell,et al.  A Direct Comparison of Biplanar Videoradiography and Optical Motion Capture for Foot and Ankle Kinematics , 2019, Front. Bioeng. Biotechnol..

[42]  Simon Lucey,et al.  Convolutional Sparse Coding for Trajectory Reconstruction , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Richard Szeliski,et al.  Towards Internet-scale multi-view stereo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[45]  PHOTOGRAMMETRIC SYNCHRONIZATION OF IMAGE SEQUENCES , 2006 .

[46]  Jean Ponce,et al.  Accurate, Dense, and Robust Multiview Stereopsis , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Takayuki Okatani,et al.  Self-Calibration-Based Approach to Critical Motion Sequences of Rolling-Shutter Structure from Motion , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Ian D. Reid,et al.  Video synchronization from human motion using rank constraints , 2009, Comput. Vis. Image Underst..

[49]  Per-Erik Forssén,et al.  Spline Error Weighting for Robust Visual-Inertial Fusion , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[51]  Shai Avidan,et al.  Photo Sequencing , 2014, International Journal of Computer Vision.