Mapping natural language commands to web elements

The web provides a rich, open-domain environment with textual, structural, and spatial properties. We propose a new task for grounding language in this environment: given a natural language command (e.g., “click on the second article”), choose the correct element on the web page (e.g., a hyperlink or text box). We collected a dataset of over 50,000 commands that capture various phenomena such as functional references (e.g. “find who made this site”), relational reasoning (e.g. “article by john”), and visual reasoning (e.g. “top-most article”). We also implemented and analyzed three baseline models that capture different phenomena present in the dataset.

[1]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[2]  Michael Gamon,et al.  Building Natural Language Interfaces to Web APIs , 2017, CIKM.

[3]  Dilek Z. Hakkani-Tür,et al.  Resolving Referring Expressions in Conversational Dialogs for Natural User Interfaces , 2014, EMNLP.

[4]  Percy Liang,et al.  Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[5]  Percy Liang,et al.  Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , 2018, ICLR.

[6]  Dan Klein,et al.  A Game-Theoretic Approach to Generating Spatial Descriptions , 2010, EMNLP.

[7]  Luke S. Zettlemoyer,et al.  Reinforcement Learning for Mapping Instructions to Actions , 2009, ACL.

[8]  Dan Klein,et al.  Alignment-Based Compositional Semantics for Instruction Following , 2015, EMNLP.

[9]  Dan Klein,et al.  Reasoning about Pragmatics with Neural Listeners and Speakers , 2016, EMNLP.

[10]  Michael C. Frank,et al.  Predicting Pragmatic Reasoning in Language Games , 2012, Science.

[11]  Andreas Paepcke,et al.  EyePoint: practical pointing and selection using gaze and keyboard , 2007, CHI.

[12]  Licheng Yu,et al.  A Joint Speaker-Listener-Reinforcer Model for Referring Expressions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ashutosh Saxena,et al.  Environment-Driven Lexicon Induction for High-Level Instructions , 2015, ACL.

[14]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[15]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[16]  Rob Miller,et al.  Sikuli: using GUI screenshots for search and automation , 2009, UIST '09.

[17]  Raymond J. Mooney,et al.  Learning to Parse Database Queries Using Inductive Logic Programming , 1996, AAAI/IAAI, Vol. 2.

[18]  Dan Klein,et al.  Unified Pragmatic Models for Generating and Following Instructions , 2017, NAACL.

[19]  Percy Liang,et al.  World of Bits: An Open-Domain Platform for Web-Based Agents , 2017, ICML.

[20]  Bowen Zhou,et al.  ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs , 2015, TACL.

[21]  Richard Socher,et al.  Dynamic Coattention Networks For Question Answering , 2016, ICLR.

[22]  David Schlangen,et al.  Obtaining referential word meanings from visual and distributional information: Experiments on object naming , 2017, ACL.

[23]  Ming-Wei Chang,et al.  Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base , 2015, ACL.

[24]  Mary Zajicek,et al.  A Web navigation tool for the blind , 1998, Assets '98.

[25]  Kin Fun Li,et al.  Keysurf: a character controlled browser for people with physical disabilities , 2008, WWW.

[26]  Jason Weston,et al.  Learning End-to-End Goal-Oriented Dialog , 2016, ICLR.

[27]  Luke S. Zettlemoyer,et al.  Online Learning of Relaxed CCG Grammars for Parsing to Logical Form , 2007, EMNLP.

[28]  Percy Liang,et al.  From Language to Programs: Bridging Reinforcement Learning and Maximum Marginal Likelihood , 2017, ACL.

[29]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[30]  Milica Gasic,et al.  POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[33]  Luke S. Zettlemoyer,et al.  Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions , 2013, TACL.

[34]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[35]  Regina Barzilay,et al.  Representation Learning for Grounded Spatial Reasoning , 2017, TACL.

[36]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[37]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[38]  Gregg Rothermel,et al.  Why do Record/Replay Tests of Web Applications Break? , 2016, 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[39]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[40]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Michael C. Frank,et al.  Learning and using language via recursive pragmatic reasoning about other agents , 2013, NIPS.

[42]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[43]  I. V. Ramakrishnan,et al.  Wizard-of-Oz evaluation of speech-driven web browsing interface for people with vision impairments , 2014, W4A.

[44]  Ravi Kuber,et al.  A novel multimodal interface for improving visually impaired people’s web accessibility , 2005, Virtual Reality.