Learning to Denoise Raw Mobile UI Layouts for Improving Datasets at Scale

The layout of a mobile screen is a critical data source for UI design research and semantic understanding of the screen. However, UI layouts in existing datasets are often noisy, have mismatches with their visual representation, or consists of generic or app-specific types that are difficult to analyze and model. In this paper, we propose the CLAY pipeline that uses a deep learning approach for denoising UI layouts, allowing us to automatically improve existing mobile UI layout datasets at scale. Our pipeline takes both the screenshot and the raw UI layout, and annotates the raw layout by removing incorrect nodes and assigning a semantically meaningful type to each node. To experiment with our data-cleaning pipeline, we create the CLAY dataset of 59,555 human-annotated screen layouts, based on screenshots and raw layouts from Rico, a public mobile UI corpus. Our deep models achieve high accuracy with F1 scores of 82.7% for detecting layout objects that do not have a valid visual representation and 85.9% for recognizing object types, which significantly outperforms a heuristic baseline. Our work lays a foundation for creating large-scale high quality UI layout datasets for data-driven mobile UI research and reduces the need of manual labeling efforts that are prohibitively expensive. CCS CONCEPTS • Human-centered computing → Human computer interaction (HCI).

[1]  Thomas F. Liu,et al.  Learning Design Semantics for Mobile Apps , 2018, UIST.

[2]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[3]  Chongyang Bai,et al.  UIBert: Learning Generic Multimodal Representations for UI Understanding , 2021, IJCAI.

[4]  Jeffrey Nichols,et al.  Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels , 2021, CHI.

[5]  Toby Jia-Jun Li,et al.  Screen2Vec: Semantic Embedding of GUI Screens and GUI Components , 2021, CHI.

[6]  Wenjun Zeng,et al.  Understanding Mobile GUI: from Pixel-Words to Screen-Sentences , 2021, ArXiv.

[7]  Tovi Grossman,et al.  Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning , 2021, UIST.

[8]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[9]  Jianfeng Xu,et al.  UI Components Recognition System Based On Image Understanding , 2020, 2020 IEEE 20th International Conference on Software Quality, Reliability and Security Companion (QRS-C).

[10]  Stefan Winkler,et al.  A data-driven approach to cleaning large face datasets , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[11]  Zhenchang Xing,et al.  Object detection for graphical user interface: old fashioned or deep learning or a combination? , 2020, ESEC/SIGSOFT FSE.

[12]  Ece Kamar,et al.  Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets , 2017, CHI.

[13]  Jeffrey P. Bigham,et al.  Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots , 2021, UIST.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Wan Mohd Nazmee Wan Zainon,et al.  A Review on Data Cleansing Methods for Big Data , 2019, Procedia Computer Science.

[16]  Ahmed K. Elmagarmid,et al.  Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes , 2013, SIGMOD '13.

[17]  Kate Saenko,et al.  Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments , 2021, ArXiv.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jianzhong Li,et al.  Cleanix: A Big Data Cleaning Parfait , 2014, CIKM.

[20]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[21]  Jeffrey Nichols,et al.  Rico: A Mobile App Dataset for Building Data-Driven Design Applications , 2017, UIST.

[22]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Samy Bengio,et al.  Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding , 2021, NeurIPS.

[25]  Xin Zhou,et al.  Mapping Natural Language Instructions to Mobile UI Action Sequences , 2020, ACL.

[26]  Zhiwei Guan,et al.  Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements , 2020, EMNLP.

[27]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[28]  Tuan Anh Nguyen,et al.  Reverse Engineering Mobile Application User Interfaces with REMAUI (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[29]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Paolo Papotti,et al.  KATARA: Reliable Data Cleaning with Knowledge Bases and Crowdsourcing , 2015, Proc. VLDB Endow..