Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision

We tackle the problem of Human Locomotion Forecasting, a task for jointly predicting the spatial positions of several keypoints on human body in the near future under an egocentric setting. In contrast to the previous work that aims to solve either the task of pose prediction or trajectory forecasting in isolation, we propose a framework to unify these two problems and address the practically useful task of pedestrian locomotion prediction in the wild. Among the major challenges in solving this task is the scarcity of annotated egocentric video datasets with dense annotations for pose, depth, or egomotion. To surmount this difficulty, we use state-of-the-art models to generate (noisy) annotations and propose robust forecasting models that can learn from this noisy supervision. We present a method to disentangle the overall pedestrian motion into easier to learn subparts by uti-lizing a pose completion and a decomposition module. The completion module fills in the missing key-point annotations and the decomposition module breaks the cleaned locomotion down to global (trajectory) and local (pose keypoint movements). Further, with Quasi RNN as our backbone, we propose a novel hierarchical trajectory forecasting network that utilizes low-level vision domain specific signals like egomotion and depth to predict the global trajectory. Our method leads to state-of-the-art results for the prediction of human locomotion in the egocentric view.

[1]  Silvio Savarese,et al.  SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Trevor Darrell,et al.  BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling , 2018, ArXiv.

[4]  Yoichi Sato,et al.  Future Person Localization in First-Person Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Silvio Savarese,et al.  Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Jun Zhu,et al.  Understanding Human Behaviors in Crowds by Imitating the Decision-Making Process , 2018, AAAI.

[7]  Luke Fletcher,et al.  Uncertainty-Aware Driver Trajectory Prediction at Urban Intersections , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[10]  Vincent Aravantinos,et al.  The Simpler the Better: Constant Velocity for Pedestrian Motion Prediction , 2019, ArXiv.

[11]  Juan Carlos Niebles,et al.  Action-Agnostic Human Pose Forecasting , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[12]  Eric Sommerlade,et al.  Modelling pedestrian trajectory patterns with Gaussian processes , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[13]  John K. Tsotsos,et al.  Joint Attention in Autonomous Driving (JAAD) , 2016, ArXiv.

[14]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Vincent Lepetit,et al.  Learning Latent Representations of 3D Human Pose with Deep Neural Networks , 2018, International Journal of Computer Vision.

[16]  Adrian Hilton,et al.  Deep Autoencoder for Combined Human Pose Estimation and body Model Upscaling , 2018, ECCV.

[17]  Helbing,et al.  Social force model for pedestrian dynamics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[18]  José M. F. Moura,et al.  Adversarial Geometry-Aware Human Motion Prediction , 2018, ECCV.

[19]  Zhen Zhang,et al.  Convolutional Sequence to Sequence Model for Human Dynamics , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jean Oh,et al.  Social Attention: Modeling Attention in Human Crowds , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Rares Ambrus,et al.  SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[23]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Matthew Johnson-Roberson,et al.  PedX: Benchmark Dataset for Metric 3-D Pose Estimation of Pedestrians in Complex Urban Intersections , 2018, IEEE Robotics and Automation Letters.

[26]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Kris M. Kitani,et al.  Forecasting Interactive Dynamics of Pedestrians with Fictitious Play , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Vincent Lepetit,et al.  Structured Prediction of 3D Human Pose with Deep Neural Networks , 2016, BMVC.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Philip H. S. Torr,et al.  DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[33]  Ulrich Kressel,et al.  Probabilistic trajectory prediction with Gaussian mixture models , 2012, 2012 IEEE Intelligent Vehicles Symposium.

[34]  Dariu Gavrila,et al.  Analysis of pedestrian dynamics from a vehicle perspective , 2014, 2014 IEEE Intelligent Vehicles Symposium Proceedings.

[35]  Yu Yao,et al.  Egocentric Vision-based Future Vehicle Localization for Intelligent Driving Assistance Systems , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[36]  Peter Kontschieder,et al.  The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Lei Chen,et al.  Learning to Forecast Videos of Human Activity with Multi-granularity Models and Adaptive Rendering , 2017, ArXiv.

[38]  Song-Chun Zhu,et al.  Inferring "Dark Matter" and "Dark Energy" from Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[40]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[41]  J. Y. Chang DR-Net: denoising and reconstruction network for 3D human pose estimation from monocular RGB videos , 2018 .

[42]  Richard Socher,et al.  Quasi-Recurrent Neural Networks , 2016, ICLR.

[43]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Otmar Hilliges,et al.  Learning Human Motion Models for Long-Term Predictions , 2017, 2017 International Conference on 3D Vision (3DV).

[45]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Dana H. Ballard,et al.  Modular Learning in Neural Networks , 1987, AAAI.

[47]  Dariu Gavrila,et al.  Context-Based Path Prediction for Targets with Switching Dynamics , 2018, International Journal of Computer Vision.

[48]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).