Natural Language Processing for Mobile App Privacy Compliance

Many Internet services collect a flurry of data from their users. Privacy policies are intended to describe the services’ privacy practices. However, due to their length and complexity, reading privacy policies is a challenge for end users, government regulators, and companies. Natural language processing holds the promise of helping address this challenge. Specifically, we focus on comparing the practices described in privacy policies to the practices performed by smartphone apps covered by those policies. Government regulators are interested in comparing apps to their privacy policies in order to detect non-compliance with laws, and companies are interested for the same reason. We frame the identification of privacy practice statements in privacy policies as a classification problem, which we address with a three-tiered approach: a privacy practice statement is classified based on a data type (e.g., location), party (i.e., first or third party), and modality (i.e., whether a practice is explicitly described as being performed or not performed). Privacy policies omit discussion of many practices. With negative F1 scores ranging from 78% to 100%, the performance results of this three-tiered classification methodology suggests an improvement over the state-of-the-art. Our NLP analysis of privacy policies is an integral part of our Mobile App Privacy System (MAPS), which we used to analyze 1,035,853 free apps on the Google Play Store. Potential compliance issues appeared to be widespread, and those involving third parties were particularly common.

[1]  Lorrie Faith Cranor,et al.  The platform for privacy preferences , 1999, CACM.

[2]  Marc Langheinrich,et al.  The platform for privacy preferences 1.0 (p3p1.0) specification , 2002 .

[3]  Aleecia M. McDonald,et al.  The Cost of Reading Privacy Policies , 2009 .

[4]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[5]  Daniel J. Solove,et al.  The FTC and the New Common Law of Privacy , 2013 .

[6]  Noah A. Smith,et al.  Unsupervised Alignment of Privacy Policies using Hidden Markov Models , 2014, ACL.

[7]  Steven M. Bellovin,et al.  Privee: An Architecture for Automatically Analyzing Web Privacy Policies , 2014, USENIX Security Symposium.

[8]  C. Kruegel,et al.  A Large-Scale Study of Mobile Web App Security , 2015 .

[9]  Alessandra Gorla,et al.  Mining Apps for Abnormal Usage of Sensitive Data , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[10]  Ricardo Neisse,et al.  A privacy enforcing framework for Android applications , 2016, Comput. Secur..

[11]  Blase Ur,et al.  A Large-Scale Evaluation of U.S. Financial Institutions’ Standardized Privacy Notices , 2016 .

[12]  Ram Krishnan,et al.  Toward a Framework for Detecting Privacy Policy Violations in Android Application Code , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[13]  Tao Zhang,et al.  Can We Trust the Privacy Policies of Android Apps? , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[14]  Frederick Liu,et al.  The Creation and Analysis of a Website Privacy Policy Corpus , 2016, ACL.

[15]  Bin Liu,et al.  Automated Analysis of Privacy Requirements for Mobile Apps , 2016, NDSS.

[16]  Norman M. Sadeh,et al.  Identifying the Provision of Choices in Privacy Policy Text , 2017, EMNLP.

[17]  Frederick Liu,et al.  Towards Automatic Classification of Privacy Policy Text , 2017 .

[18]  Timothy Libert,et al.  An Automated Approach to Auditing Disclosure of Third-Party Data Collection in Website Privacy Policies , 2018, WWW.

[19]  Narseo Vallina-Rodriguez,et al.  Bug Fixes, Improvements, ... and Privacy Leaks - A Longitudinal Study of PII Leaks Across Android App Versions , 2018, NDSS.

[20]  Norman M. Sadeh,et al.  Which Apps Have Privacy Policies? - An Analysis of Over One Million Google Play Store Apps , 2018, APF.

[21]  Toru Nakamura,et al.  I Read but Don't Agree: Privacy Policy Benchmarking using Machine Learning and the EU GDPR , 2018, WWW.

[22]  Narseo Vallina-Rodriguez,et al.  “Won’t Somebody Think of the Children?” Examining COPPA Compliance at Scale , 2018, Proc. Priv. Enhancing Technol..

[23]  Kang G. Shin,et al.  Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning , 2018, USENIX Security Symposium.

[24]  Yu Hu,et al.  Sensibility Testbed: Automated IRB Policy Enforcement in Mobile Research Apps , 2018, HotMobile.

[25]  Norman M. Sadeh,et al.  MAPS: Scaling Privacy Compliance Analysis to a Million Apps , 2019, Proc. Priv. Enhancing Technol..

[26]  Vinayshekhar Bannihatti Kumar,et al.  Quantifying the Effect of In-Domain Distributed Word Representations : A Study of Privacy Policies , 2019 .