Improved Models and Queries for Grounded Human-Robot Dialog

The ability to understand and communicate in natural language can make robots much more accessible for naive users. Environments such as homes and offices contain many objects that humans describe in diverse language referencing perceptual properties. Robots operating in such environments need to be able to understand such descriptions. Different types of dialog interactions with humans can help robots clarify their understanding to reduce mistakes, and also improve their language understanding models, or adapt them to the specific domain of operation. We present completed work on jointly learning a dialog policy that enables a robot to clarify partially understood natural language commands, while simultaneously using the dialogs to improve the underlying semantic parser for future commands. We introduce the setting of opportunistic active learning a framework for interactive tasks that use supervised models. This framework allows a robot to ask diverse, potentially off-topic queries across interactions, requiring the robot to trade-off between task completion and knowledge acquisition for future tasks. We also attempt to learn a dialog policy in this framework using reinforcement learning We propose a novel distributional model for perceptual grounding, based on learning a joint space for vector representations from multiple modalities. We also propose a method for identifying more informative clarification questions that can scale well to a larger space of objects, and wish to learn a dialog policy that would make use of such clarifications.

[1]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[3]  Manuel Lopes,et al.  Active Learning for Teaching a Robot Grounded Relational Symbols , 2013, IJCAI.

[4]  Peter Stone,et al.  CORPP: Commonsense Reasoning and Probabilistic Planning, as Applied to Dialog with a Mobile Robot , 2015, AAAI.

[5]  José M. F. Moura,et al.  VisualWord2Vec (Vis-W2V): Learning Visually Grounded Word Embeddings Using Abstract Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[7]  Song-Chun Zhu,et al.  Jointly Learning Grounded Task Structures from Language Instruction and Visual Demonstration , 2016, EMNLP.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  David Vandyke,et al.  Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems , 2015, EMNLP.

[10]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[11]  Rodney D. Nielsen,et al.  Grounding the Meaning of Words through Vision and Interactive Gameplay , 2015, IJCAI.

[12]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[13]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[14]  Eunsol Choi,et al.  Scaling Semantic Parsers with On-the-Fly Ontology Matching , 2013, EMNLP.

[15]  Ross A. Knepper,et al.  Asking for Help Using Inverse Semantics , 2014, Robotics: Science and Systems.

[16]  Zheng Chen,et al.  Effective multi-label active learning for text classification , 2009, KDD.

[17]  Raymond J. Mooney,et al.  Dialog for Language to Code , 2017, IJCNLP.

[18]  Carina Silberer,et al.  Grounded Models of Semantic Representation , 2012, EMNLP.

[19]  Peter Stone,et al.  Learning a Policy for Opportunistic Active Learning , 2018, EMNLP.

[20]  Daniel Marcu,et al.  Natural Language Communication with Robots , 2016, NAACL.

[21]  Steve J. Young,et al.  Bayesian update of dialogue state: A POMDP framework for spoken dialogue systems , 2010, Comput. Speech Lang..

[22]  Martial Hebert,et al.  From Red Wine to Red Tomato: Composition with Context , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[24]  Matthew R. Walter,et al.  Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences , 2015, AAAI.

[25]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[28]  Chelsea Finn,et al.  Active One-shot Learning , 2017, ArXiv.

[29]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[30]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Yoav Artzi,et al.  Learning to Map Context-Dependent Sentences to Executable Formal Queries , 2018, NAACL.

[32]  Joyce Yue Chai,et al.  Collaborative Models for Referring Expression Generation in Situated Dialogue , 2014, AAAI.

[33]  Luke S. Zettlemoyer,et al.  Bootstrapping Semantic Parsers from Conversations , 2011, EMNLP.

[34]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[35]  Hervé Frezza-Buet,et al.  Sample-efficient batch reinforcement learning for dialogue management optimization , 2011, TSLP.

[36]  John Langford,et al.  Mapping Instructions and Visual Observations to Actions with Reinforcement Learning , 2017, EMNLP.

[37]  Changsong Liu,et al.  Towards Situated Dialogue: Revisiting Referring Expression Generation , 2013, EMNLP.

[38]  Angeliki Lazaridou,et al.  Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world , 2014, ACL.

[39]  Huang Hu,et al.  Playing 20 Question Game with Policy-Based Reinforcement Learning , 2018, EMNLP.

[40]  Oliver Lemon,et al.  Learning how to Learn: An Adaptive Dialogue Agent for Incrementally Learning Visually Grounded Word Meanings , 2017, RoboNLP@ACL.

[41]  Carina Silberer,et al.  Learning Grounded Meaning Representations with Autoencoders , 2014, ACL.

[42]  Xian-Sheng Hua,et al.  Two-Dimensional Multilabel Active Learning with an Efficient Online Adaptation Model for Image Classification , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Yuan Li,et al.  Learning how to Active Learn: A Deep Reinforcement Learning Approach , 2017, EMNLP.

[44]  Milica Gasic,et al.  POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[45]  Matthieu Geist,et al.  Kalman Temporal Differences , 2010, J. Artif. Intell. Res..

[46]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Joyce Yue Chai,et al.  Interactive Learning of Grounded Verb Semantics towards Human-Robot Communication , 2017, ACL.

[48]  Andrew Bennett,et al.  CHALET: Cornell House Agent Learning Environment , 2018, ArXiv.

[49]  Peter Stone,et al.  Learning to Interpret Natural Language Commands through Human-Robot Dialog , 2015, IJCAI.

[50]  David Vandyke,et al.  A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[51]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[52]  Jayant Krishnamurthy,et al.  Toward Interactive Grounded Language Acqusition , 2013, Robotics: Science and Systems.

[53]  Manuela M. Veloso,et al.  Learning environmental knowledge from task-based human-robot dialog , 2013, 2013 IEEE International Conference on Robotics and Automation.

[54]  Milica Gasic,et al.  Gaussian Processes for POMDP-Based Dialogue Manager Optimization , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[55]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[56]  Peter Stone,et al.  Opportunistic Active Learning for Grounding Natural Language Descriptions , 2017, CoRL.

[57]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[58]  Shaohua Yang,et al.  Physical Causality of Action Verbs in Grounded Language Understanding , 2016, ACL.

[59]  Arturo Espinosa-Romero,et al.  Talking to Godot: dialogue with a mobile robot , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[60]  Roberto Basili,et al.  A Discriminative Approach to Grounded Spoken Language Understanding in Interactive Robotics , 2016, IJCAI.

[61]  Yejin Choi,et al.  Verb Physics: Relative Physical Knowledge of Actions and Objects , 2017, ACL.

[62]  José M. F. Moura,et al.  Visual Dialog , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[64]  Raymond J. Mooney,et al.  Learning Synchronous Grammars for Semantic Parsing with Lambda Calculus , 2007, ACL.

[65]  Trevor Darrell,et al.  Understanding object descriptions in robotics by open-vocabulary object retrieval and detection , 2016, Int. J. Robotics Res..

[66]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[67]  Philip Bachman,et al.  Learning Algorithms for Active Learning , 2017, ICML.

[68]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[69]  Luke S. Zettlemoyer,et al.  Learning to Parse Natural Language Commands to a Robot Control System , 2012, ISER.

[70]  David Vandyke,et al.  Multi-domain Dialog State Tracking using Recurrent Neural Networks , 2015, ACL.

[71]  Xin Li,et al.  Active Learning with Multi-Label SVM Classification , 2013, IJCAI.

[72]  Mark Steedman,et al.  Combinatory Categorial Grammar , 2011 .

[73]  Mohan Singh,et al.  Active Learning for Multi-Label Image Annotation , 2009 .

[74]  Raymond J. Mooney,et al.  Integrated Learning of Dialog Strategies and Semantic Parsing , 2017, EACL.

[75]  Manali Sharma,et al.  Evidence-based uncertainty sampling for active learning , 2016, Data Mining and Knowledge Discovery.

[76]  Stefanie Tellex,et al.  Learning to Parse Natural Language to Grounded Reward Functions with Weak Supervision , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[77]  Matthieu Geist,et al.  A Comprehensive Reinforcement Learning Framework for Dialogue Management Optimization , 2012, IEEE Journal of Selected Topics in Signal Processing.

[78]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[79]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[80]  Stefanie Tellex,et al.  Grounding natural language instructions to semantic goal representations for abstraction and generalization , 2018, Auton. Robots.

[81]  Yejin Choi,et al.  Globally Coherent Text Generation with Neural Checklist Models , 2016, EMNLP.

[82]  Ross A. Knepper,et al.  Mapping Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction , 2018, CoRL.

[83]  Xiaojie Wang,et al.  Correspondence Autoencoders for Cross-Modal Retrieval , 2015, ACM Trans. Multim. Comput. Commun. Appl..

[84]  Shaohua Yang,et al.  What Action Causes This? Towards Naive Physical Action-Effect Prediction , 2018, ACL.

[85]  Jason Weston,et al.  Learning End-to-End Goal-Oriented Dialog , 2016, ICLR.

[86]  Stevan Harnad The Symbol Grounding Problem , 1999, ArXiv.

[87]  Guido Bugmann,et al.  Mobile robot programming using natural language , 2002, Robotics Auton. Syst..

[88]  Daniel Jurafsky,et al.  Eye Spy: Improving Vision through Dialog , 2010, AAAI Fall Symposium: Dialog with Robots.

[89]  Oliver Lemon,et al.  Training an adaptive dialogue policy for interactive learning of visually grounded word meanings , 2016, SIGDIAL Conference.

[90]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[91]  David Whitney,et al.  Reducing errors in object-fetching interactions through social feedback , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[92]  Geoffrey Zweig,et al.  End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning , 2016, ArXiv.

[93]  Peter Stone,et al.  Learning Multi-Modal Grounded Linguistic Semantics by Playing "I Spy" , 2016, IJCAI.

[94]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[95]  Ashish Kapoor,et al.  Active learning for sparse bayesian multilabel classification , 2014, KDD.

[96]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[97]  Milica Gasic,et al.  The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management , 2010, Comput. Speech Lang..

[98]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[99]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[100]  Haris Dindo,et al.  A probabilistic approach to learning a visually grounded language model through human-robot interaction , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[101]  Changsong Liu,et al.  Grounded Semantic Role Labeling , 2016, NAACL.

[102]  Klaus Brinker,et al.  On Active Learning in Multi-label Classification , 2005, GfKl.

[103]  Yunyi Jia,et al.  Back to the Blocks World: Learning New Actions through Situated Human-Robot Dialogue , 2014, SIGDIAL Conference.

[104]  Yong Jae Lee,et al.  Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[105]  Chen Huang,et al.  Learning to Disambiguate by Asking Discriminative Questions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[106]  Peter Stone,et al.  Learning to Order Objects Using Haptic and Proprioceptive Exploratory Behaviors , 2016, IJCAI.