NL2GDPR: Automatically Develop GDPR Compliant Android Application Features from Natural Language

The recent privacy leakage incidences and the more strict policy regulations demand a much higher standard of compliance for companies and mobile apps. However, such obligations also impose significant challenges on app developers for complying with these regulations that contain various perspectives, activities, and roles, especially for small companies and developers who are less experienced in this matter or with limited resources. To address these hurdles, we develop an automatic tool, NL2GDPR, which can generate policies from natural language descriptions from the developer while also ensuring the app's functionalities are compliant with General Data Protection Regulation (GDPR). NL2GDPR is developed by leveraging an information extraction tool, OIA (Open Information Annotation), developed by Baidu Cognitive Computing Lab. At the core, NL2GDPR is a privacy-centric information extraction model, appended with a GDPR policy finder and a policy generator. We perform a comprehensive study to grasp the challenges in extracting privacy-centric information and generating privacy policies, while exploiting optimizations for this specific task. With NL2GDPR, we can achieve 92.9%, 95.2%, and 98.4% accuracy in correctly identifying GDPR policies related to personal data storage, process, and share types, respectively. To the best of our knowledge, NL2GDPR is the first tool that allows a developer to automatically generate GDPR compliant policies, with only the need of entering the natural language for describing the app features. Note that other non-GDPR-related features might be integrated with the generated features to build a complex app.

[1]  Yue Zhang,et al.  End-to-end Distantly Supervised Information Extraction with Retrieval Augmentation , 2022, SIGIR.

[2]  Yingjie Lao,et al.  DeepAuth: A DNN Authentication Framework by Model-Unique and Fragile Signature Embedding , 2022, AAAI.

[3]  Yingjie Lao,et al.  Integrity Authentication in Tree Models , 2022, KDD.

[4]  Weijie Zhao,et al.  Identification for Deep Neural Network: Simply Adjusting Few Weights! , 2022, 2022 IEEE 38th International Conference on Data Engineering (ICDE).

[5]  Khoa D Doan,et al.  LIRA: Learnable, Imperceptible and Robust Backdoor Attacks , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Yingjie Lao,et al.  Robust Watermarking for Deep Neural Networks via Bi-level Optimization , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Hongliang Fei,et al.  ReadsRE: Retrieval-Augmented Distantly Supervised Relation Extraction , 2021, SIGIR.

[8]  Wasi Uddin Ahmad,et al.  Text2App: A Framework for Creating Android Apps from Text Descriptions , 2021, ArXiv.

[9]  Guisong Liu,et al.  Deep neural network-based relation extraction: an overview , 2021, Neural Computing and Applications.

[10]  Xin Wang,et al.  A Predicate-Function-Argument Annotation of Natural Language for Open-Domain Information Expression , 2020, EMNLP.

[11]  Liyan Xu,et al.  Revealing the Myth of Higher-Order Inference in Coreference Resolution , 2020, EMNLP.

[12]  Yukyung Lee,et al.  Multiˆ2OIE: Multilingual Open Information Extraction based on Multi-Head Attention with BERT , 2020, FINDINGS.

[13]  Yang Liu,et al.  An Empirical Evaluation of GDPR Compliance Violations in Android mHealth Apps , 2020, 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE).

[14]  Xu Li,et al.  Extracting Knowledge from Web Text with Monte Carlo Tree Search , 2020, WWW.

[15]  D. Veale,et al.  Readability and Quality of Online Information on Osteoarthritis: An Objective Analysis With Historic Comparison , 2019, Interactive journal of medical research.

[16]  Martin Degeling,et al.  (Un)informed Consent: Studying GDPR Consent Notices in the Field , 2019, CCS.

[17]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[18]  Norman M. Sadeh,et al.  MAPS: Scaling Privacy Compliance Analysis to a Million Apps , 2019, Proc. Priv. Enhancing Technol..

[19]  Xu Li,et al.  End-to-end Deep Reinforcement Learning Based Coreference Resolution , 2019, ACL.

[20]  Ruby B. Lee,et al.  Sensitive-Sample Fingerprinting of Deep Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ralph Holz,et al.  Data sharing practices of medicines related apps and the mobile ecosystem: traffic, content, and network analysis , 2019, BMJ.

[22]  Hai Zhao,et al.  Span Model for Open Information Extraction on Accurate Corpus , 2019, AAAI.

[23]  Florian Kammüller,et al.  Designing Data Protection for GDPR Compliance into IoT Healthcare Systems , 2019, ArXiv.

[24]  Kassem Fawaz,et al.  The Privacy Policy Landscape After the GDPR , 2018, Proc. Priv. Enhancing Technol..

[25]  Ido Dagan,et al.  Supervised Open Information Extraction , 2018, NAACL.

[26]  Erik Derr,et al.  The Rise of the Citizen Developer: Assessing the Security Impact of Online App Generators , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[27]  Ming Zhou,et al.  Neural Open Information Extraction , 2018, ACL.

[28]  Toru Nakamura,et al.  I Read but Don't Agree: Privacy Policy Benchmarking using Machine Learning and the EU GDPR , 2018, WWW.

[29]  Benny Pinkas,et al.  Turning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring , 2018, USENIX Security Symposium.

[30]  Wei Cai,et al.  A Survey on Security Threats and Defensive Techniques of Machine Learning: A Data Driven View , 2018, IEEE Access.

[31]  Miao Fan,et al.  Logician: A Unified End-to-End Neural Approach for Open-Domain Information Extraction , 2018, WSDM.

[32]  Michael Backes,et al.  A Stitch in Time: Supporting Android Developers in WritingSecure Code , 2017, CCS.

[33]  Nikolay Mehandjiev,et al.  A Comparative Study of Android and iOS Mobile Applications' Data Handling Practices Versus Compliance to Privacy Policy , 2017, Privacy and Identity Management.

[34]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[35]  Kevin Gimpel,et al.  Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext , 2017, EMNLP.

[36]  Tony Beltramelli,et al.  pix2code: Generating Code from a Graphical User Interface Screenshot , 2017, EICS.

[37]  Michael Backes,et al.  You Get Where You're Looking for: The Impact of Information Sources on Code Security , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[38]  S. Shyam Sundar,et al.  Make it Simple, or Force Users to Read?: Paraphrased Design Improves Comprehension of End User License Agreements , 2016, CHI.

[39]  Viktor Kuncak,et al.  Synthesizing Java expressions from free-form queries , 2015, OOPSLA.

[40]  Tao Zhang,et al.  AutoPPG: Towards Automatic Generation of Privacy Policy for Android Applications , 2015, SPSM@CCS.

[41]  Steven M. Bellovin,et al.  Privee: An Architecture for Automatically Analyzing Web Privacy Policies , 2014, USENIX Security Symposium.

[42]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[43]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[44]  Xin Wang,et al.  OIE@OIA: an Adaptable and Efficient Open Information Extraction Framework , 2022, ACL.

[45]  Sebastian Zimmeck,et al.  PrivacyFlash Pro: Automating Privacy Policy Generation for Mobile Apps , 2021, NDSS.

[46]  Khoa D Doan,et al.  Backdoor Attack with Imperceptible Input and Latent Modification , 2021, NeurIPS.

[47]  Danny S. Guamán,et al.  GDPR Compliance Assessment for Cross-Border Personal Data Transfers in Android Apps , 2021, IEEE Access.

[48]  Zhen Zhang,et al.  TKPERM: Cross-platform Permission Knowledge Transfer to Detect Overprivileged Third-party Applications , 2020, NDSS.

[49]  Xu Li,et al.  An Advantage Actor-Critic Algorithm with Confidence Exploration for Open Information Extraction , 2020, SDM.

[50]  Gang Wang,et al.  VerHealth: Vetting Medical Voice Applications through Policy Enforcement , 2020, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[51]  Tao Xie,et al.  PolicyLint: Investigating Internal Privacy Policy Contradictions on Google Play , 2019, USENIX Security Symposium.

[52]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[53]  Pietro Ferrara,et al.  Static Analysis for GDPR Compliance , 2018, ITASEC.

[54]  Xu Li,et al.  Logician and Orator: Learning from the Duality between Language and Knowledge in Open Domain , 2018, EMNLP.

[55]  Yuan Zhang,et al.  Finding Clues for Your Secrets: Semantics-Driven, Learning-Based Privacy Discovery in Mobile Apps , 2018, NDSS.

[56]  Frederick Liu,et al.  Towards Automatic Classification of Privacy Policy Text , 2017 .

[57]  Xi Victoria Lin Program Synthesis from Natural Language Using Recurrent Neural Networks , 2017 .

[58]  Daniela Yidan Miao,et al.  PrivacyInformer : an automated privacy description generator for the MIT App Inventor , 2014 .