Language2Pose: Natural Language Grounded Pose Forecasting

Generating animations from natural language sentences finds its applications in a a number of domains such as movie script visualization, virtual human animation and, robot motion planning. These sentences can describe different kinds of actions, speeds and direction of these actions, and possibly a target destination. The core modeling challenge in this language-to-pose application is how to map linguistic concepts to motion animations. In this paper, we address this multimodal problem by introducing a neural architecture called Joint Language-to-Pose (or JL2P), which learns a joint embedding of language and pose. This joint embedding space is learned end-to-end using a curriculum learning approach which emphasizes shorter and easier sequences first before moving to longer and harder ones. We evaluate our proposed model on a publicly available corpus of 3D pose data and human-annotated sentences. Both objective metrics and human judgment evaluation confirm that our proposed approach is able to generate more accurate animations and are deemed visually more representative by humans than other data driven approaches.

[1]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[4]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Stacy Marsella,et al.  How to Train Your Avatar: A Data Driven Approach to Gesture Generation , 2011, IVA.

[6]  Timothy Ha,et al.  Text2Action: Generative Adversarial Synthesis from Language to Action , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[8]  Yoshihiko Nakamura,et al.  Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions , 2015, Int. J. Robotics Res..

[9]  Tamim Asfour,et al.  The KIT Motion-Language Dataset , 2016, Big Data.

[10]  Yong K. Hwang,et al.  Interactive task planning through natural language , 1996, Proceedings of IEEE International Conference on Robotics and Automation.

[11]  Scott Cohen,et al.  Forecasting Human Dynamics from Static Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Dario Pavllo,et al.  QuaterNet: A Quaternion-based Recurrent Model for Human Motion , 2018, BMVC.

[13]  Taku Komura,et al.  Phase-functioned neural networks for character control , 2017, ACM Trans. Graph..

[14]  Kazuhiko Sumi,et al.  Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM , 2017, HAI.

[15]  Tamim Asfour,et al.  Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks , 2017, Robotics Auton. Syst..

[16]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[18]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[19]  Scott E. Reed,et al.  Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis , 2015, NIPS.

[20]  Juan Carlos Niebles,et al.  Action-Agnostic Human Pose Forecasting , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[21]  Minhua Ma,et al.  Virtual human animation in natural language visualisation , 2007, Artificial Intelligence Review.

[22]  Stacy Marsella,et al.  Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach , 2015, IVA.

[23]  Raymond J. Mooney,et al.  Generating Animated Videos of Human Activities from Natural Language Descriptions , 2018 .

[24]  Markus H. Gross,et al.  Generating Animations from Screenplays , 2019, *SEMEVAL.

[25]  Joan Condell,et al.  SceneMaker: Intelligent Multimodal Visualisation of Natural Language Scripts , 2009, AICS.

[26]  Taku Komura,et al.  A Deep Learning Framework for Character Motion Synthesis and Editing , 2016, ACM Trans. Graph..

[27]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Michiel van de Panne,et al.  Task-based locomotion , 2016, ACM Trans. Graph..

[29]  Zhe Wang,et al.  Pose Guided Human Video Generation , 2018, ECCV.

[30]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[31]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[32]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[33]  Luc Van Gool,et al.  Pose Guided Person Image Generation , 2017, NIPS.

[34]  Otmar Hilliges,et al.  Learning Human Motion Models for Long-Term Predictions , 2017, 2017 International Conference on 3D Vision (3DV).

[35]  Xiao Lin,et al.  Human Motion Modeling using DVGANs , 2018, ArXiv.

[36]  Tetsuya Ogata,et al.  Paired Recurrent Autoencoders for Bidirectional Translation Between Robot Actions and Linguistic Descriptions , 2018, IEEE Robotics and Automation Letters.

[37]  Wei Liu,et al.  Long-Term Human Motion Prediction by Modeling Motion Context and Enhancing Motion Dynamic , 2018, IJCAI.

[38]  Yoshihiko Nakamura,et al.  Bigram-based natural language model and statistical motion symbol model for scalable language of humanoid robots , 2012, 2012 IEEE International Conference on Robotics and Automation.