Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning

Mobile User Interface Summarization generates succinct language descriptions of mobile screens for conveying important contents and functionalities of the screen, which can be useful for many language-based application scenarios. We present Screen2Words, a novel screen summarization approach that automatically encapsulates essential information of a UI screen into a coherent language phrase. Summarizing mobile screens requires a holistic understanding of the multi-modal data of mobile UIs, including text, image, structures as well as UI semantics, motivating our multi-modal learning approach. We collected and analyzed a large-scale screen summarization dataset annotated by human workers. Our dataset contains more than 112k language summarization across ∼ 22k unique UI screens. We then experimented with a set of deep models with different configurations. Our evaluation of these models with both automatic accuracy metrics and human rating shows that our approach can generate high-quality summaries for mobile screens. We demonstrate potential use cases of Screen2Words and open-source our dataset and model to lay the foundations for further bridging language and user interfaces.

[1]  Ranjitha Kumar,et al.  ERICA: Interaction Mining Mobile Apps , 2016, UIST.

[2]  Berna Erol,et al.  Multimodal summarization of meeting recordings , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[3]  Eduard Hovy,et al.  Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[4]  Gilles Louppe,et al.  Independent consultant , 2013 .

[5]  Aaron Hertzmann,et al.  Predicting Visual Importance Across Graphic Design Types , 2020, UIST.

[6]  Amos Azaria,et al.  SUGILITE: Creating Multimodal Smartphone Automation by Demonstration , 2017, CHI.

[7]  Thomas F. Liu,et al.  Learning Design Semantics for Mobile Apps , 2018, UIST.

[8]  Jacob O. Wobbrock,et al.  Epidemiology as a Framework for Large-Scale Mobile Application Accessibility Assessment , 2017, ASSETS.

[9]  Francine Chen,et al.  Video to Text Summary: Joint Video Summarization and Captioning with Recurrent Neural Networks , 2017, BMVC.

[10]  Jeffrey Nichols,et al.  Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels , 2021, CHI.

[11]  Toby Jia-Jun Li,et al.  Screen2Vec: Semantic Embedding of GUI Screens and GUI Components , 2021, CHI.

[12]  Mahmood Yousefi-Azar,et al.  Text summarization using unsupervised deep learning , 2017, Expert Syst. Appl..

[13]  Yu Zhou,et al.  MSMO: Multimodal Summarization with Multimodal Output , 2018, EMNLP.

[14]  Yang Li,et al.  Modeling Mobile Interface Tappability Using Crowdsourcing and Deep Learning , 2019, CHI.

[15]  I. V. Ramakrishnan,et al.  More than meets the eye: a survey of screen-reader browsing strategies , 2010, W4A.

[16]  Oriana Riva,et al.  Kite: Building Conversational Bots from Mobile Apps , 2018, MobiSys.

[17]  Zhiwei Guan,et al.  Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements , 2020, EMNLP.

[18]  Yu Zhou,et al.  Multimodal Summarization with Guidance of Multimodal Reference , 2020, AAAI.

[19]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[20]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Xin Zhou,et al.  Mapping Natural Language Instructions to Mobile UI Action Sequences , 2020, ACL.

[24]  Xin Wang,et al.  No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling , 2018, ACL.

[25]  Sheetal Rathi,et al.  A Survey on Deep Learning-Based Automatic Text Summarization Models , 2020 .

[26]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[27]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[28]  Muhammad Usman Ghani Khan,et al.  ASoVS: Abstractive Summarization of Video Sequences , 2019, IEEE Access.

[29]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Tom M. Mitchell,et al.  Multi-Modal Repairs of Conversational Breakdowns in Task-Oriented Dialogs , 2020, UIST.

[32]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[33]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[34]  Francis Ferraro,et al.  Visual Storytelling , 2016, NAACL.

[35]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[36]  Tingfa Xu,et al.  LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators , 2019, ICLR.

[37]  Ming-Hsuan Yang,et al.  Neural Design Network: Graphic Layout Generation with Constraints , 2019, European Conference on Computer Vision.

[38]  Jacob O. Wobbrock,et al.  Examining Image-Based Button Labeling for Accessibility in Android Apps through Large-Scale Analysis , 2018, ASSETS.

[39]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Trevor Darrell,et al.  Object Hallucination in Image Captioning , 2018, EMNLP.

[41]  Jeffrey Nichols,et al.  Swire: Sketch-based User Interface Retrieval , 2019, CHI.

[42]  James Fogarty,et al.  Robust Annotation of Mobile Application Interfaces in Methods for Accessibility Repair and Enhancement , 2018, UIST.

[43]  Toby Jia-Jun Li,et al.  PUMICE: A Multi-Modal Agent that Learns Concepts and Conditionals from Natural Language and Demonstrations , 2019, UIST.

[44]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[45]  Ravi Kuber,et al.  DETERMINING THE ACCESSIBILITY OF MOBILE SCREEN READERS FOR BLIND USERS , 2012 .

[46]  Kyle Montague,et al.  Open Challenges of Blind People Using Smartphones , 2020, Int. J. Hum. Comput. Interact..

[47]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[48]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Jeffrey Nichols,et al.  Rico: A Mobile App Dataset for Building Data-Driven Design Applications , 2017, UIST.

[50]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[51]  Elizabeth D. Liddy,et al.  Advances in Automatic Text Summarization , 2001, Information Retrieval.

[52]  Magy Seif El-Nasr,et al.  VINS: Visual Search for Mobile User Interface Design , 2021, CHI.