EventHands: Real-Time Neural 3D Hand Pose Estimation from an Event Stream

3D hand pose estimation from monocular videos is a long-standing and challenging problem, which is now seeing a strong upturn. In this work, we address it for the first time using a single event camera, i.e., an asynchronous vision sensor reacting on brightness changes. Our EventHands approach has characteristics previously not demonstrated with a single RGB or depth camera such as high temporal resolution at low data throughputs and real-time performance at 1000 Hz. Due to the different data modality of event cameras compared to classical cameras, existing methods cannot be directly applied to and re-trained for event streams. We thus design a new neural approach which accepts a new event stream representation suitable for learning, which is trained on newly-generated synthetic event streams and can generalise to real data. Experiments show that EventHands outperforms recent monocular methods using a colour (or depth) camera in terms of accuracy and its ability to capture hand motions of unprecedented speed. Our method, the event stream simulator and the dataset are publicly available (see https://4dqv.mpi-inf.mpg.de/EventHands/).

[1]  Hans-Peter Seidel,et al.  Differentiable Event Stream Simulator for Non-Rigid 3D Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  Wolfgang Heidrich,et al.  Stereo Event-Based Particle Tracking Velocimetry for 3D Fluid Flow Reconstruction , 2020, ECCV.

[3]  Shaojie Shen,et al.  Event-Based Stereo Visual Odometry , 2020, IEEE Transactions on Robotics.

[4]  Li Liu,et al.  JGR-P2O: Joint Graph Reasoning based Pixel-to-Offset Prediction Network for 3D Hand Pose Estimation from a Single Depth Image , 2020, ECCV.

[5]  Toby Sharp,et al.  The Phong Surface: Efficient 3D Model Fitting using Lifted Optimization , 2020, ECCV.

[6]  Christian Theobalt,et al.  HandVoxNet: Deep Voxel-Based Network for 3D Hand Shape and Pose Estimation From a Single Depth Map , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Miaomiao Liu,et al.  Single Image Optical Flow Estimation With an Event Camera , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  C. Theobalt,et al.  Monocular Real-Time Hand Shape and Motion Capture Using Multi-Modal Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  J. Kautz,et al.  Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints , 2020, ECCV.

[10]  Davide Scaramuzza,et al.  Dynamic obstacle avoidance for quadrotors with event cameras , 2020, Science Robotics.

[11]  Rika Sugimoto Dimitrova,et al.  Towards Low-Latency High-Bandwidth Control of Quadrotors using Event Cameras , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Angela Yao,et al.  Aligning Latent Spaces for 3D Hand Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Christian Theobalt,et al.  EventCap: Monocular 3D Capture of High-Speed Human Motions Using an Event Camera , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Luc Van Gool,et al.  Dual Grid Net: hand mesh vertex regression from single depth maps , 2019, ECCV.

[15]  V. Lepetit,et al.  HOnnotate: A Method for 3D Annotation of Hand and Object Poses , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Vladlen Koltun,et al.  High Speed and High Dynamic Range Video with an Event Camera , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Chiara Bartolozzi,et al.  Event-Based Vision: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Davide Scaramuzza,et al.  End-to-End Learning of Representations for Asynchronous Event-Based Data , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Tae-Kyun Kim,et al.  Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Qiang Li,et al.  End-to-End Hand Mesh Recovery From a Monocular RGB Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Gregory Cohen,et al.  Observational evaluation of event cameras performance in optical space surveillance , 2019 .

[23]  Junsong Yuan,et al.  Space-Time Event Clouds for Gesture Recognition: From RGB Cameras to Event Cameras , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[24]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Anders P. Eriksson,et al.  Star Tracking Using an Event Camera , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[26]  Dongheui Lee,et al.  Point-To-Pose Voting Based Hand Pose Estimation Using Residual Permutation Equivariant Layer , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  D. Scaramuzza,et al.  EMVS: Event-Based Multi-View Stereo—3D Reconstruction with an Event Camera in Real-Time , 2018, International Journal of Computer Vision.

[28]  Xin Yu,et al.  Bringing a Blurry Frame Alive at High Frame-Rate With an Event Camera , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Nick Barnes,et al.  Continuous-time Intensity Estimation Using Event Cameras , 2018, ACCV.

[30]  Davide Scaramuzza,et al.  ESIM: an Open Event Camera Simulator , 2018, CoRL.

[31]  Jianfei Cai,et al.  Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images , 2018, ECCV.

[32]  Margarita Chli,et al.  ACE: An Efficient Asynchronous Corner Tracker for Event Cameras , 2018, 2018 International Conference on 3D Vision (3DV).

[33]  Didier Stricker,et al.  DeepHPS: End-to-end Estimation of 3D Hand Pose and Shape by Learning from Synthetic Depth , 2018, 2018 International Conference on 3D Vision (3DV).

[34]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[35]  Bo Chen,et al.  MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[37]  Yi Zhou,et al.  Semi-Dense 3D Reconstruction with a Stereo Event Camera , 2018, ECCV.

[38]  Margarita Chli,et al.  Asynchronous Corner Detection and Tracking for Event Cameras in Real Time , 2018, IEEE Robotics and Automation Letters.

[39]  Junsong Yuan,et al.  Hand PointNet: 3D Hand Pose Estimation Using Point Sets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Narciso García,et al.  Event-Based Vision Meets Deep Learning on Steering Prediction for Self-Driving Cars , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Otmar Hilliges,et al.  Cross-Modal Deep Variational Hand Pose Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Ryad Benosman,et al.  HATS: Histograms of Averaged Time Surfaces for Robust Event-Based Object Classification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Yiannis Aloimonos,et al.  Event-Based Moving Object Detection and Tracking , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[44]  Kostas Daniilidis,et al.  EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras , 2018, Robotics: Science and Systems.

[45]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Christian Theobalt,et al.  GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Kyoung Mu Lee,et al.  V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  David Kim,et al.  Articulated distance fields for ultra-fast tracking of hands interacting , 2017, ACM Trans. Graph..

[49]  T. Rösgen,et al.  Three-dimensional particle tracking velocimetry using dynamic vision sensors , 2017 .

[50]  Davide Scaramuzza,et al.  Real-time Visual-Inertial Odometry for Event Cameras using Keyframe-based Nonlinear Optimization , 2017, BMVC.

[51]  Tobi Delbrück,et al.  A Low Power, Fully Event-Based Gesture Recognition System , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Garrick Orchard,et al.  HOTS: A Hierarchy of Event-Based Time-Surfaces for Pattern Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Luc Van Gool,et al.  Crossing Nets: Combining GANs and VAEs with a Shared Latent Space for Hand Pose Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Stefan Leutenegger,et al.  Simultaneous Optical Flow and Intensity Estimation from an Event Camera , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Rüdiger Dillmann,et al.  Towards a framework for end-to-end control of a simulated vehicle with spiking neural networks , 2016, 2016 IEEE International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR).

[58]  Stefan Leutenegger,et al.  Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera , 2016, ECCV.

[59]  Chiara Bartolozzi,et al.  Fast event-based Harris corner detection exploiting the advantages of event-driven cameras , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[60]  T. Pock,et al.  Real-Time Intensity-Image Reconstruction for Event Cameras Using Manifold Regularisation , 2016, International Journal of Computer Vision.

[61]  Wenzhen Yuan,et al.  Fast localization and tracking using event sensors , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[62]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Vincent Lepetit,et al.  Training a Feedback Loop for Hand Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[64]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[66]  Horst Bischof,et al.  Event-driven stereo matching for real-time 3D panoramic vision , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Ken Perlin,et al.  Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[68]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[69]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Daniel Cremers,et al.  Event-based 3D SLAM with a depth-augmented dynamic vision sensor , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[71]  Chiara Bartolozzi,et al.  Event-Based Visual Flow , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[72]  Jörg Conradt,et al.  Simultaneous Localization and Mapping for Event-Based Vision Systems , 2013, ICVS.

[73]  Andrew Howard,et al.  Design and use paradigms for Gazebo, an open-source multi-robot simulator , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[74]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[75]  Bin Zhou,et al.  Learning to See in the Dark with Events , 2020, ECCV.

[76]  Christian Theobalt,et al.  HTML: A Parametric Hand Texture Model for 3D Hand Reconstruction and Personalization , 2020, ECCV.

[77]  T. Delbruck,et al.  A 128 128 120 dB 15 s Latency Asynchronous Temporal Contrast Vision Sensor , 2006 .

[78]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .