Semantically Synchronizing Multiple-Camera Systems with Human Pose Estimation

Multiple-camera systems can expand coverage and mitigate occlusion problems. However, temporal synchronization remains a problem for budget cameras and capture devices. We propose an out-of-the-box framework to temporally synchronize multiple cameras using semantic human pose estimation from the videos. Human pose predictions are obtained with an out-of-the-shelf pose estimator for each camera. Our method firstly calibrates each pair of cameras by minimizing an energy function related to epipolar distances. We also propose a simple yet effective multiple-person association algorithm across cameras and a score-regularized energy function for improved performance. Secondly, we integrate the synchronized camera pairs into a graph and derive the optimal temporal displacement configuration for the multiple-camera system. We evaluate our method on four public benchmark datasets and demonstrate robust sub-frame synchronization accuracy on all of them.

[1]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3]  Stephen Gould,et al.  Multiview Detection with Feature Perspective Transformation , 2020, ECCV.

[4]  L. Davis,et al.  M2Tracker: A Multi-View Approach to Segmenting and Tracking People in a Cluttered Scene , 2003, International Journal of Computer Vision.

[5]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Wenjun Zeng,et al.  AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild , 2020, International Journal of Computer Vision.

[7]  Nassir Navab,et al.  3D Pictorial Structures for Multiple Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[9]  Yichen Wei,et al.  Integral Human Pose Regression , 2017, ECCV.

[10]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Interaction Capture , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Wenjun Zeng,et al.  Fusing Wearable IMUs With Multi-View Images for Human Pose Estimation: A Geometric Approach , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Zhiao Huang,et al.  Associative Embedding: End-to-End Learning for Joint Detection and Grouping , 2016, NIPS.

[15]  Marc Pollefeys,et al.  Camera network calibration from dynamic silhouettes , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[16]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[17]  Hans Weda,et al.  Synchronization of multiple video recordings based on still camera flashes , 2006, MM '06.

[18]  Peng Liu,et al.  Monocular Depth Estimation with Joint Attention Feature Distillation and Wavelet-Based Loss Function , 2021, Sensors.

[19]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[21]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[22]  Mauro Barbieri,et al.  Synchronization of multi-camera video recordings based on audio , 2007, ACM Multimedia.

[23]  Hideaki Kimata,et al.  Human Pose as Calibration Pattern: 3D Human Pose Estimation with Multiple Unsynchronized and Uncalibrated Cameras , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[24]  Yizhou Wang,et al.  MetaFuse: A Pre-trained Fusion Model for Human Pose Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Bo Zhao,et al.  AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding , 2017, ArXiv.

[26]  H. Ai,et al.  Cross-View Tracking for Multi-Human 3D Pose Estimation at Over 100 FPS , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Hideo Saito,et al.  Reconstructing the 3D Trajectory of a Ball with Unsynchronized Cameras , 2015, Int. J. Comput. Sci. Sport.

[28]  Wenjun Zeng,et al.  VoxelPose: Towards Multi-camera 3D Human Pose Estimation in Wild Environment , 2020, ECCV.

[29]  Zhengyou Zhang,et al.  Flexible camera calibration by viewing a plane from unknown orientations , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[30]  Yichen Wei,et al.  Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[31]  Hans-Peter Seidel,et al.  Markerless Motion Capture with unsynchronized moving cameras , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  David Vázquez,et al.  On-Board Detection of Pedestrian Intentions , 2017, Sensors.

[34]  Wenjun Zeng,et al.  Cross View Fusion for 3D Human Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).