Grounding Open-Domain Instructions to Automate Web Support Tasks

Grounding natural language instructions on the web to perform previously unseen tasks enables accessibility and automation. We introduce a task and dataset to train AI agents from open-domain, step-by-step instructions originally written for people. We build RUSS (Rapid Universal Support Service) to tackle this problem. RUSS consists of two models: First, a BERT-LSTM with pointers parses instructions to WebLang, a domain-specific language we design for grounding natural language on the web. Then, a grounding model retrieves the unique IDs of any webpage elements requested in the WebLang. RUSS may interact with the user through a dialogue (e.g. ask for an address) or execute a web operation (e.g. click a button) inside the web runtime. To augment training, we synthesize natural language instructions mapped to WebLang. Our dataset consists of 80 different customer service problems from help websites, with a total of 741 step-by-step instructions and their corresponding actions. RUSS achieves 76.7% end-to-end accuracy predicting agent actions from single instructions. It outperforms state-of-the-art models that directly map instructions to actions without WebLang. Our user study shows that RUSS is preferred by actual users over web navigation.

[1]  Peter Stone,et al.  Learning Multi-Modal Grounded Linguistic Semantics by Playing "I Spy" , 2016, IJCAI.

[2]  Kallirroi Georgila,et al.  Edit me: A Corpus and a Framework for Understanding Natural Language Image Editing , 2018, LREC.

[3]  Percy Liang,et al.  Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , 2018, ICLR.

[4]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[5]  Monica S. Lam,et al.  Multi-Modal End-User Programming of Web-Based Virtual Assistant Skills , 2020, ArXiv.

[6]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[7]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  John Langford,et al.  Mapping Instructions and Visual Observations to Actions with Reinforcement Learning , 2017, EMNLP.

[10]  Eben M. Haber,et al.  CoScripter: automating & sharing how-to knowledge in the enterprise , 2008, CHI.

[11]  Dorsa Sadigh,et al.  Learning Adaptive Language Interfaces through Decomposition , 2020, INTEXSEMPAR.

[12]  Akiko Aizawa,et al.  From Natural Language Instructions to Complex Processes: Issues in Chaining Trigger Action Rules , 2020, ArXiv.

[13]  Raia Hadsell,et al.  Learning to Navigate in Cities Without a Map , 2018, NeurIPS.

[14]  Mohit Bansal,et al.  ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments , 2020, FINDINGS.

[15]  Schema2QA: High-Quality and Low-Cost Q&A Agents for the Structured Web , 2020, CIKM.

[16]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  J. V. Bradley Complete Counterbalancing of Immediate Sequential Effects in a Latin Square Design , 1958 .

[19]  Monica S. Lam,et al.  Genie: a generator of natural language semantic parsers for virtual assistant commands , 2019, PLDI.

[20]  Xin Zhou,et al.  Mapping Natural Language Instructions to Mobile UI Action Sequences , 2020, ACL.

[21]  Percy Liang,et al.  Mapping natural language commands to web elements , 2018, EMNLP.

[22]  Zhanna Sarsenbayeva,et al.  Situational Impairments during Mobile Interaction , 2018, UbiComp/ISWC Adjunct.

[23]  Trevor Darrell,et al.  Grounding Visual Explanations , 2018, ECCV.

[24]  Gökhan Tür,et al.  Multi-Modal Conversational Search and Browse , 2013, SLAM@INTERSPEECH.

[25]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Oriana Riva,et al.  Kite: Building Conversational Bots from Mobile Apps , 2018, MobiSys.

[27]  Xiang 'Anthony' Chen,et al.  Geno: A Developer Tool for Authoring Multimodal Interaction on Existing Web Applications , 2020, UIST.

[28]  Amos Azaria,et al.  SUGILITE: Creating Multimodal Smartphone Automation by Demonstration , 2017, CHI.

[29]  Nathanael Chambers,et al.  PLOW: A Collaborative Task Learning Agent , 2007, AAAI.

[30]  Peter Stone,et al.  Improving Grounded Natural Language Understanding through Human-Robot Dialog , 2019, 2019 International Conference on Robotics and Automation (ICRA).