Caption Generation of Robot Behaviors based on Unsupervised Learning of Action Segments

Bridging robot action sequences and their natural language captions is an important task to increase explainability of human assisting robots in their recently evolving field. In this paper, we propose a system for generating natural language captions that describe behaviors of human assisting robots. The system describes robot actions by using robot observations; histories from actuator systems and cameras, toward end-to-end bridging between robot actions and natural language captions. Two reasons make it challenging to apply existing sequence-to-sequence models to this mapping: 1) it is hard to prepare a large-scale dataset for any kind of robots and their environment, and 2) there is a gap between the number of samples obtained from robot action observations and generated word sequences of captions. We introduced unsupervised segmentation based on K-means clustering to unify typical robot observation patterns into a class. This method makes it possible for the network to learn the relationship from a small amount of data. Moreover, we utilized a chunking method based on byte-pair encoding (BPE) to fill in the gap between the number of samples of robot action observations and words in a caption. We also applied an attention mechanism to the segmentation task. Experimental results show that the proposed model based on unsupervised learning can generate better descriptions than other methods. We also show that the attention mechanism did not work well in our low-resource setting.

[1]  Fuminori Saito,et al.  HSR, Human Support Robot as Research and Development Platform , 2015 .

[2]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Purnima Bholowalia,et al.  EBK-Means: A Clustering Technique based on Elbow Method and K-Means in WSN , 2014 .

[4]  Masahide Kaneko,et al.  Segmenting Continuous Motions with Hidden Semi-markov Models and Gaussian Processes , 2017, Front. Neurorobot..

[5]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[6]  Graham Neubig,et al.  Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis , 2011, ACL.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Yuichiro Yoshikawa,et al.  Simulator platform that enables social interaction simulation — SIGVerse: SocioIntelliGenesis simulator , 2010, 2010 IEEE/SICE International Symposium on System Integration.

[9]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[10]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[11]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Maja J. Mataric,et al.  Using semantic fields to model dynamic spatial relations in a robot architecture for natural language instruction of service robots , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13]  Tetsuya Ogata,et al.  Paired Recurrent Autoencoders for Bidirectional Translation Between Robot Actions and Linguistic Descriptions , 2018, IEEE Robotics and Automation Letters.

[14]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[15]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[16]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[17]  Tomoichi Takahashi,et al.  Competition task development for response robot innovation in World Robot Summit , 2017, 2017 IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR).

[18]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[19]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[20]  Stefanie Tellex,et al.  Toward understanding natural language directions , 2010, 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[21]  Yoshihiko Nakamura,et al.  Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions , 2015, Int. J. Robotics Res..

[22]  Kuniyuki Takahashi,et al.  Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Satoshi Nakamura,et al.  Learning, Generation and Recognition of Motions by Reference-Point-Dependent Probabilistic Models , 2011, Adv. Robotics.

[24]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[25]  Peter Stone,et al.  Improving Grounded Natural Language Understanding through Human-Robot Dialog , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[26]  Jivko Sinapov,et al.  Improving Grounded Natural Language Understanding through Human-Robot Dialog , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[27]  Tamim Asfour,et al.  Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks , 2017, Robotics Auton. Syst..

[28]  Tomoaki Nakamura,et al.  Grounding of word meanings in multimodal concepts using LDA , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[29]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.