论文信息 - Text2Action: Generative Adversarial Synthesis from Language to Action

Text2Action: Generative Adversarial Synthesis from Language to Action

In this paper, we propose a generative model which learns the relationship between language and human action in order to generate a human action sequence given a sentence describing human behavior. The proposed generative model is a generative adversarial network (GAN), which is based on the sequence to sequence (SEQ2SEQ) model. Using the proposed generative network, we can synthesize various actions for a robot or a virtual agent using a text encoder recurrent neural network (RNN) and an action decoder RNN. The proposed generative network is trained from 29,770 pairs of actions and sentence annotations extracted from MSR-Video-to-Text (MSR-VTT), a large-scale video dataset. We demonstrate that the network can generate human-like actions which can be transferred to a Baxter robot, such that the robot performs an action based on a provided sentence. Results show that the proposed generative network correctly models the relationship between language and action and can generate a diverse set of actions from the same sentence.

[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[2] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[3] Tamim Asfour,et al. Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks , 2017, Robotics Auton. Syst..

[4] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.

[5] S. Eddy. Hidden Markov models. , 1996, Current opinion in structural biology.

[6] Yoshihiko Nakamura,et al. Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions , 2015, Int. J. Robotics Res..

[7] Xiaowei Zhou,et al. Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Tamim Asfour,et al. The KIT Motion-Language Dataset , 2016, Big Data.

[9] Yoshihiko Nakamura,et al. Symbolically structured database for human whole body motions based on association between motion symbols and motion words , 2015, Robotics Auton. Syst..

[10] Geoffrey E. Hinton,et al. Grammar as a Foreign Language , 2014, NIPS.

[11] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12] Thomas Brox,et al. Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.

[13] Christian Ledig,et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[15] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Varun Ramakrishna,et al. Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[18] E. Ribes-Iñesta,et al. Human Behavior as Language: Some Thoughts on Wittgenstein , 2006 .