论文信息 - Grounding Open-Domain Instructions to Automate Web Support Tasks

Grounding Open-Domain Instructions to Automate Web Support Tasks

Grounding natural language instructions on the web to perform previously unseen tasks enables accessibility and automation. We introduce a task and dataset to train AI agents from open-domain, step-by-step instructions originally written for people. We build RUSS (Rapid Universal Support Service) to tackle this problem. RUSS consists of two models: First, a BERT-LSTM with pointers parses instructions to WebLang, a domain-specific language we design for grounding natural language on the web. Then, a grounding model retrieves the unique IDs of any webpage elements requested in the WebLang. RUSS may interact with the user through a dialogue (e.g. ask for an address) or execute a web operation (e.g. click a button) inside the web runtime. To augment training, we synthesize natural language instructions mapped to WebLang. Our dataset consists of 80 different customer service problems from help websites, with a total of 741 step-by-step instructions and their corresponding actions. RUSS achieves 76.7% end-to-end accuracy predicting agent actions from single instructions. It outperforms state-of-the-art models that directly map instructions to actions without WebLang. Our user study shows that RUSS is preferred by actual users over web navigation.

[1] Peter Stone,et al. Learning Multi-Modal Grounded Linguistic Semantics by Playing "I Spy" , 2016, IJCAI.

[2] Kallirroi Georgila,et al. Edit me: A Corpus and a Framework for Understanding Natural Language Image Editing , 2018, LREC.

[3] Percy Liang,et al. Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , 2018, ICLR.

[4] Raymond J. Mooney,et al. Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[5] Monica S. Lam,et al. Multi-Modal End-User Programming of Web-Based Virtual Assistant Skills , 2020, ArXiv.

[6] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[7] Christopher D. Manning,et al. Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[8] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[9] John Langford,et al. Mapping Instructions and Visual Observations to Actions with Reinforcement Learning , 2017, EMNLP.

[10] Eben M. Haber,et al. CoScripter: automating & sharing how-to knowledge in the enterprise , 2008, CHI.

[11] Dorsa Sadigh,et al. Learning Adaptive Language Interfaces through Decomposition , 2020, INTEXSEMPAR.

[12] Akiko Aizawa,et al. From Natural Language Instructions to Complex Processes: Issues in Chaining Trigger Action Rules , 2020, ArXiv.

[13] Raia Hadsell,et al. Learning to Navigate in Cities Without a Map , 2018, NeurIPS.

[14] Mohit Bansal,et al. ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments , 2020, FINDINGS.

[15] Schema2QA: High-Quality and Low-Cost Q&A Agents for the Structured Web , 2020, CIKM.

[16] Richard Socher,et al. A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[17] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18] J. V. Bradley. Complete Counterbalancing of Immediate Sequential Effects in a Latin Square Design , 1958 .

[19] Monica S. Lam,et al. Genie: a generator of natural language semantic parsers for virtual assistant commands , 2019, PLDI.

[20] Xin Zhou,et al. Mapping Natural Language Instructions to Mobile UI Action Sequences , 2020, ACL.

[21] Percy Liang,et al. Mapping natural language commands to web elements , 2018, EMNLP.

[22] Zhanna Sarsenbayeva,et al. Situational Impairments during Mobile Interaction , 2018, UbiComp/ISWC Adjunct.

[23] Trevor Darrell,et al. Grounding Visual Explanations , 2018, ECCV.

[24] Gökhan Tür,et al. Multi-Modal Conversational Search and Browse , 2013, SLAM@INTERSPEECH.

[25] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26] Oriana Riva,et al. Kite: Building Conversational Bots from Mobile Apps , 2018, MobiSys.

[27] Xiang 'Anthony' Chen,et al. Geno: A Developer Tool for Authoring Multimodal Interaction on Existing Web Applications , 2020, UIST.

[28] Amos Azaria,et al. SUGILITE: Creating Multimodal Smartphone Automation by Demonstration , 2017, CHI.

[29] Nathanael Chambers,et al. PLOW: A Collaborative Task Learning Agent , 2007, AAAI.

[30] Peter Stone,et al. Improving Grounded Natural Language Understanding through Human-Robot Dialog , 2019, 2019 International Conference on Robotics and Automation (ICRA).