MVOR: A Multi-view RGB-D Operating Room Dataset for 2D and 3D Human Pose Estimation

Person detection and pose estimation is a key requirement to develop intelligent context-aware assistance systems. To foster the development of human pose estimation methods and their applications in the Operating Room (OR), we release the Multi-View Operating Room (MVOR) dataset, the first public dataset recorded during real clinical interventions. It consists of 732 synchronized multi-view frames recorded by three RGB-D cameras in a hybrid OR. It also includes the visual challenges present in such environments, such as occlusions and clutter. We provide camera calibration parameters, color and depth frames, human bounding boxes, and 2D/3D pose annotations. In this paper, we present the dataset, its annotations, as well as baseline results from several recent person detection and 2D/3D pose estimation methods. Since we need to blur some parts of the images to hide identity and nudity in the released dataset, we also present a comparative study of how the baselines have been impacted by the blurring. Results show a large margin for improvement and suggest that the MVOR dataset can be useful to compare the performance of the different methods.

[1]  Nicolas Padoy,et al.  A Multi-view RGB-D Approach for Human Pose Estimation in Operating Rooms , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[2]  Nassir Navab,et al.  Parsing human skeletons in an operating room , 2016, Machine Vision and Applications.

[3]  Nicolas Padoy,et al.  Seeing is believing: increasing intraoperative awareness to scattered radiation in interventional procedures by combining augmented reality, Monte Carlo simulations and wireless dosimeters , 2015, International Journal of Computer Assisted Radiology and Surgery.

[4]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Nicolas Padoy,et al.  A generalizable approach for multi-view 3D human pose regression , 2018, Machine Vision and Applications.

[7]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[8]  Tomás Svoboda,et al.  A Convenient Multicamera Self-Calibration for Virtual Environments , 2005, Presence: Teleoperators & Virtual Environments.

[9]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[11]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Pascal Fua,et al.  Making Action Recognition Robust to Occlusions and Viewpoint Changes , 2010, ECCV.

[14]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.