Modeling Human Motion with Quaternion-Based Neural Networks

Previous work on predicting or generating 3D human pose sequences regresses either joint rotations or joint positions. The former strategy is prone to error accumulation along the kinematic chain, as well as discontinuities when using Euler angles or exponential maps as parameterizations. The latter requires re-projection onto skeleton constraints to avoid bone stretching and invalid configurations. This work addresses both limitations. QuaterNet represents rotations with quaternions and our loss function performs forward kinematics on a skeleton to penalize absolute position errors instead of angle errors. We investigate both recurrent and convolutional architectures and evaluate on short-term prediction and long-term generation. For the latter, our approach is qualitatively judged as realistic as recent neural strategies from the graphics literature. Our experiments compare quaternions to Euler angles as well as exponential maps and show that only a very short context is required to make reliable future predictions. Finally, we show that the standard evaluation protocol for Human3.6M produces high variance results and we propose a simple solution.

[1]  Jon A. Webb,et al.  Quaternions in Computer Vision and Robotics , 1982 .

[2]  Ken Shoemake,et al.  Animating rotation with quaternion curves , 1985, SIGGRAPH.

[3]  J. Michael McCarthy,et al.  Introduction to theoretical kinematics , 1990 .

[4]  Norman I. Badler,et al.  Simulating humans: computer graphics animation and control , 1993 .

[5]  F. Sebastian Grassia,et al.  Practical Parameterization of Rotations Using the Exponential Map , 1998, J. Graphics, GPU, & Game Tools.

[6]  Franck Multon,et al.  Computer animation of human walking: a survey , 1999, Comput. Animat. Virtual Worlds.

[7]  Alberto Menache,et al.  Understanding Motion Capture for Computer Animation and Video Games , 1999 .

[8]  Franck Multon,et al.  Computer Animation of Human Walking: a Survey , 1999 .

[9]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[10]  Adrian Hilton,et al.  Realistic synthesis of novel human movements from a database of motion capture examples , 2000, Proceedings Workshop on Human Motion.

[11]  Vladimir Pavlovic,et al.  Learning Switching Linear Models of Human Motion , 2000, NIPS.

[12]  Tim Cootes,et al.  An Introduction to Active Shape Models , 2000 .

[13]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[14]  J. Stoer,et al.  Introduction to Numerical Analysis , 2002 .

[15]  Ian Parberry,et al.  3D Math Primer for Graphics and Game Development, 2nd Edition , 2011 .

[16]  David A. Forsyth,et al.  Motion synthesis from annotations , 2003, ACM Trans. Graph..

[17]  R. Chellappa,et al.  View independent human body pose estimation from a single perspective image , 2004, CVPR 2004.

[18]  R. Chellappa,et al.  View independent human body pose estimation from a single perspective image , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[19]  C. K. Liu,et al.  Learning physics-based motion style with nonlinear inverse optimization , 2005, SIGGRAPH 2005.

[20]  C. Karen Liu,et al.  Learning physics-based motion style with nonlinear inverse optimization , 2005, ACM Trans. Graph..

[21]  Pascal Fua,et al.  Hierarchical implicit surface joint limits for human body tracking , 2005, Comput. Vis. Image Underst..

[22]  David A. Forsyth,et al.  Computational Studies of Human Motion: Part 1, Tracking and Motion Synthesis , 2005, Found. Trends Comput. Graph. Vis..

[23]  Luc Van Gool,et al.  European conference on computer vision (ECCV) , 2006, eccv 2006.

[24]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[25]  Steven M. LaValle,et al.  Planning algorithms , 2006 .

[26]  James F. O'Brien,et al.  Computational Studies of Human Motion , 2006 .

[27]  Adrien Treuille,et al.  Near-optimal character animation with continuous control , 2007, SIGGRAPH 2007.

[28]  Tido Röder,et al.  Documentation Mocap Database HDM05 , 2007 .

[29]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[30]  Du Q. Huynh,et al.  Metrics for 3D Rotations: Comparison and Analysis , 2009, Journal of Mathematical Imaging and Vision.

[31]  Dima Damen,et al.  Computer Vision and Pattern Recognition (CVPR) , 2009 .

[32]  Gino van den Bergen,et al.  Game Physics Pearls , 2010 .

[33]  Cristian Sminchisescu,et al.  Latent structured models for human pose estimation , 2011, 2011 International Conference on Computer Vision.

[34]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[35]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[36]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[37]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[38]  Roland Göcke,et al.  Monocular Image 3D Human Pose Estimation under Self-Occlusion , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  Jessica K. Hodgins,et al.  Hierarchical Aligned Cluster Analysis for Temporal Clustering of Human Motion , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[41]  Ruzena Bajcsy,et al.  Berkeley MHAD: A comprehensive Multimodal Human Action Database , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[42]  Sebastian Nowozin,et al.  Efficient Nonlinear Markov Models for Human Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[44]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[45]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Andreas Krause,et al.  Advances in Neural Information Processing Systems (NIPS) , 2014 .

[47]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[48]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[51]  Jessica K. Hodgins,et al.  Realtime style transfer for unlabeled heterogeneous human motion , 2015, ACM Trans. Graph..

[52]  Taku Komura,et al.  Learning motion manifolds with convolutional autoencoders , 2015, SIGGRAPH Asia Technical Briefs.

[53]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[54]  Vincent Lepetit,et al.  Hands Deep in Deep Learning for Hand Pose Estimation , 2015, ArXiv.

[55]  J. Dai Euler–Rodrigues formula variations, quaternion conjugation and intrinsic connections , 2015 .

[56]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[57]  Michael J. Black,et al.  Pose-conditioned joint angle limits for 3D human pose reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[59]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[60]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Yichen Wei,et al.  Model-Based Deep Hand Pose Estimation , 2016, IJCAI.

[62]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[63]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[64]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Alexander M. Rush,et al.  Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[67]  Jordi Gonzàlez,et al.  Human Pose Estimation from Monocular Images: A Comprehensive Survey , 2016, Sensors.

[68]  Martial Hebert,et al.  An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[69]  Gabriel Synnaeve,et al.  Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[70]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[71]  Wei Zhang,et al.  Deep Kinematic Pose Regression , 2016, ECCV Workshops.

[72]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[73]  Taku Komura,et al.  A deep learning framework for character motion synthesis and editing , 2016, ACM Trans. Graph..

[74]  Anoop Cherian,et al.  Human Pose Forecasting via Deep Markov Models , 2017, 2017 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[75]  Dieter Fox,et al.  SE3-nets: Learning rigid body motion using deep neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[76]  Sushil Kumar,et al.  Machine Learning with Resilient Propagation in Quaternionic Domain , 2017 .

[77]  Scott Cohen,et al.  Forecasting Human Dynamics from Static Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Martial Hebert,et al.  The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[79]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[80]  Yann LeCun,et al.  Predicting Deeper into the Future of Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[81]  Taku Komura,et al.  Phase-functioned neural networks for character control , 2017, ACM Trans. Graph..

[82]  Fei Han,et al.  Space-Time Representation of People Based on 3D Skeletal Data: A Review , 2016, Comput. Vis. Image Underst..

[83]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Yi Zhou,et al.  Auto-Conditioned LSTM Network for Extended Complex Human Motion Synthesis , 2017, ArXiv.

[85]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[86]  Otmar Hilliges,et al.  Learning Human Motion Models for Long-Term Predictions , 2017, 2017 International Conference on 3D Vision (3DV).

[87]  Danica Kragic,et al.  Deep Representation Learning for Human Motion Prediction and Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[88]  Ruben Villegas,et al.  Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[89]  Ying Zhang,et al.  Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition , 2018, INTERSPEECH.

[90]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[91]  Zhen Zhang,et al.  Convolutional Sequence to Sequence Model for Human Dynamics , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[92]  Minho Lee,et al.  Human Action Generation with Generative Adversarial Networks , 2018, ArXiv.

[93]  Dario Pavllo,et al.  QuaterNet: A Quaternion-based Recurrent Model for Human Motion , 2018, BMVC.

[94]  Yi Xu,et al.  Quaternion Convolutional Neural Networks , 2018, ECCV.

[95]  Danica Kragic,et al.  Anticipating Many Futures: Online Human Motion Prediction and Generation for Human-Robot Interaction , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[96]  Ruben Villegas,et al.  Neural Kinematic Networks for Unsupervised Motion Retargetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[97]  Ira Kemelmacher-Shlizerman,et al.  Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[98]  José M. F. Moura,et al.  Adversarial Geometry-Aware Human Motion Prediction , 2018, ECCV.

[99]  Yann LeCun,et al.  Predicting Future Instance Segmentations by Forecasting Convolutional Features , 2018, ECCV.

[100]  Anthony S. Maida,et al.  Deep Quaternion Networks , 2017, 2018 International Joint Conference on Neural Networks (IJCNN).

[101]  Xiao Lin,et al.  Human Motion Modeling using DVGANs , 2018, ArXiv.

[102]  C. Lee Giles,et al.  A Neural Temporal Model for Human Motion Prediction , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[103]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[104]  Titouan Parcollet,et al.  Quaternion Recurrent Neural Networks , 2018, ICLR.

[105]  Zhiyong Wang,et al.  Combining Recurrent Neural Networks and Adversarial Training for Human Motion Synthesis and Control , 2018, IEEE Transactions on Visualization and Computer Graphics.