Mapping Natural Language Instructions to Mobile UI Action Sequences

We present a new problem: grounding natural language instructions to mobile user interface actions, and create three new datasets for it. For full task evaluation, we create PixelHelp, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator. To scale training, we decouple the language and action data by (a) annotating action phrase spans in How-To instructions and (b) synthesizing grounded descriptions of actions for mobile user interfaces. We use a Transformer to extract action phrase tuples from long-range natural language instructions. A grounding Transformer then contextually represents UI objects using both their content and screen position and connects them to object descriptions. Given a starting screen and instruction, our model achieves 70.59% accuracy on predicting complete ground-truth action sequences in PixelHelp.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  John Langford,et al.  Mapping Instructions and Visual Observations to Actions with Reinforcement Learning , 2017, EMNLP.

[3]  Nitesh V. Chawla,et al.  Multi-Input Multi-Output Sequence Labeling for Joint Extraction of Fact and Condition Tuples from Scientific Text , 2019, EMNLP/IJCNLP.

[4]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5]  Dilek Z. Hakkani-Tür,et al.  Learning to Navigate the Web , 2018, ICLR.

[6]  Yoav Artzi,et al.  TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Matthew R. Walter,et al.  Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences , 2015, AAAI.

[8]  Zhanna Sarsenbayeva,et al.  Situational Impairments during Mobile Interaction , 2018, UbiComp/ISWC Adjunct.

[9]  Kallirroi Georgila,et al.  Edit me: A Corpus and a Framework for Understanding Natural Language Image Editing , 2018, LREC.

[10]  Yoav Artzi,et al.  Learning to Map Context-Dependent Sentences to Executable Formal Queries , 2018, NAACL.

[11]  Luke S. Zettlemoyer,et al.  Reading between the Lines: Learning to Map High-Level Instructions to Commands , 2010, ACL.

[12]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[13]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[15]  Mathias Niepert,et al.  Learning Convolutional Neural Networks for Graphs , 2016, ICML.

[16]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Omer Levy,et al.  Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling , 2018, ACL.

[19]  Yang Li,et al.  Area Attention , 2018, ICML.

[20]  Percy Liang,et al.  Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , 2018, ICLR.

[21]  Kenton Lee,et al.  Learning Recurrent Span Representations for Extractive Question Answering , 2016, ArXiv.

[22]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[23]  Luke S. Zettlemoyer,et al.  Reinforcement Learning for Mapping Instructions to Actions , 2009, ACL.

[24]  Huda Khayrallah,et al.  Natural Language For Human Robot Interaction , 2015 .

[25]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Jeffrey Nichols,et al.  Rico: A Mobile App Dataset for Building Data-Driven Design Applications , 2017, UIST.