Text2Action: Generative Adversarial Synthesis from Language to Action

In this paper, we propose a generative model which learns the relationship between language and human action in order to generate a human action sequence given a sentence describing human behavior. The proposed generative model is a generative adversarial network (GAN), which is based on the sequence to sequence (SEQ2SEQ) model. Using the proposed generative network, we can synthesize various actions for a robot or a virtual agent using a text encoder recurrent neural network (RNN) and an action decoder RNN. The proposed generative network is trained from 29,770 pairs of actions and sentence annotations extracted from MSR-Video-to-Text (MSR-VTT), a large-scale video dataset. We demonstrate that the network can generate human-like actions which can be transferred to a Baxter robot, such that the robot performs an action based on a provided sentence. Results show that the proposed generative network correctly models the relationship between language and action and can generate a diverse set of actions from the same sentence.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[3]  Tamim Asfour,et al.  Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks , 2017, Robotics Auton. Syst..

[4]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[5]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[6]  Yoshihiko Nakamura,et al.  Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions , 2015, Int. J. Robotics Res..

[7]  Xiaowei Zhou,et al.  Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Tamim Asfour,et al.  The KIT Motion-Language Dataset , 2016, Big Data.

[9]  Yoshihiko Nakamura,et al.  Symbolically structured database for human whole body motions based on association between motion symbols and motion words , 2015, Robotics Auton. Syst..

[10]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Thomas Brox,et al.  Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.

[13]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[15]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[18]  E. Ribes-Iñesta,et al.  Human Behavior as Language: Some Thoughts on Wittgenstein , 2006 .