Multimodal Interactive Learning of Primitive Actions

We describe an ongoing project in learning to perform primitive actions from demonstrations using an interactive interface. In our previous work, we have used demonstrations captured from humans performing actions as training samples for a neural network-based trajectory model of actions to be performed by a computational agent in novel setups. We found that our original framework had some limitations that we hope to overcome by incorporating communication between the human and the computational agent, using the interaction between them to fine-tune the model learned by the machine. We propose a framework that uses multimodal human-computer interaction to teach action concepts to machines, making use of both live demonstration and communication through natural language, as two distinct teaching modalities, while requiring few training samples.

[1]  Albert C. Stevens,et al.  Distortions in judged spatial relations , 1978, Cognitive Psychology.

[2]  D. Gentner,et al.  Studies of inference from lack of knowledge , 1981, Memory & cognition.

[3]  T. Lozano-Perez,et al.  Robot programming , 1983, Proceedings of the IEEE.

[4]  Johan de Kleer,et al.  Readings in qualitative reasoning about physical systems , 1990 .

[5]  Francesco Orilia,et al.  Semantics and Cognition , 1991 .

[6]  Anthony G. Cohn,et al.  Representing and Reasoning with Qualitative Spatial Relations About Regions , 1997 .

[7]  Xiaolan Fu,et al.  Video helps remote work: speakers who need to negotiate common ground benefit from seeing each other , 1999, CHI '99.

[8]  Nikolaos Papanikolopoulos,et al.  Learning Dynamic Event Descriptions in Image Sequences , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Aude Billard,et al.  What is the Teacher"s Role in Robot Programming by Demonstration? - Toward Benchmarks for Improved Learning , 2007 .

[10]  Yi Zhang,et al.  Qualitative spatial reasoning about Internal Cardinal Direction relations , 2007 .

[11]  Robert Sekuler,et al.  Geometric structure and chunking in reproduction of motion sequences. , 2008, Journal of vision.

[12]  Andrea Lockerd Thomaz,et al.  Teachable robots: Understanding human teaching behavior to build more effective robot learners , 2008, Artif. Intell..

[13]  Tamim Asfour,et al.  Imitation Learning of Dual-Arm Manipulation Tasks in Humanoid Robots , 2006, 2006 6th IEEE-RAS International Conference on Humanoid Robots.

[14]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[15]  James Pustejovsky,et al.  The Qualitative Spatial Dynamics of Motion in Language , 2011, Spatial Cogn. Comput..

[16]  Nico Van de Weghe,et al.  Implementing a qualitative calculus to analyse moving point objects , 2011, Expert Syst. Appl..

[17]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[18]  Maya Cakmak,et al.  Trajectories and keyframes for kinesthetic teaching: A human-robot interaction perspective , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[19]  J. Pustejovsky Dynamic Event Structure and Habitat Theory , 2013 .

[20]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[21]  Andrea Lockerd Thomaz,et al.  Robot Learning from Human Teachers , 2014, Robot Learning from Human Teachers.

[22]  Anthony G. Cohn,et al.  Learning Relational Event Models from Video , 2015, J. Artif. Intell. Res..

[23]  Silvio Savarese,et al.  Watch-n-patch: Unsupervised understanding of actions and relations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Kenneth D. Forbus,et al.  Extending Analogical Generalization with Near-Misses , 2015, AAAI.

[25]  James Pustejovsky,et al.  VoxSim: A Visual Platform for Modeling Motion Language , 2016, COLING.

[26]  Anthony G. Cohn,et al.  QSRlib: a software library for online acquisition of qualitative spatial relations from video , 2016 .

[27]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  James Pustejovsky,et al.  ECAT: Event Capture Annotation Tool , 2016, ArXiv.

[29]  James Pustejovsky,et al.  Learning event representation: As sparse as possible, but not sparser , 2017, ArXiv.

[30]  John E. Laird,et al.  Grounding Language for Interactive Task Learning , 2017, RoboNLP@ACL.

[31]  Bruce A. Draper,et al.  Communicating and Acting: Understanding Gesture in Simulation Semantics , 2017, IWCS.

[32]  Basura Fernando,et al.  Unsupervised Human Action Detection by Action Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[33]  Jan Peters,et al.  Active Incremental Learning of Robot Movement Primitives , 2017, CoRL.

[34]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[35]  Mark Steedman,et al.  Universal Semantic Parsing , 2017, EMNLP.

[36]  Bruce A. Draper,et al.  Cooperating with Avatars Through Gesture, Language and Action , 2018, IntelliSys.

[37]  James Pustejovsky,et al.  Teaching Virtual Agents to Perform Complex Spatial-Temporal Activities , 2018, AAAI Spring Symposia.

[38]  Dmitry Berenson,et al.  Simultaneous learning of hierarchy and primitives for complex robot tasks , 2019, Auton. Robots.