Analyzing Privacy Policies at Scale

Website privacy policies are often long and difficult to understand. While research shows that Internet users care about their privacy, they do not have the time to understand the policies of every website they visit, and most users hardly ever read privacy policies. Some recent efforts have aimed to use a combination of crowdsourcing, machine learning, and natural language processing to interpret privacy policies at scale, thus producing annotations for use in interfaces that inform Internet users of salient policy details. However, little attention has been devoted to studying the accuracy of crowdsourced privacy policy annotations, how crowdworker productivity can be enhanced for such a task, and the levels of granularity that are feasible for automatic analysis of privacy policies. In this article, we present a trajectory of work addressing each of these topics. We include analyses of crowdworker performance, evaluation of a method to make a privacy-policy oriented task easier for crowdworkers, a coarse-grained approach to labeling segments of policy text with descriptive themes, and a fine-grained approach to identifying user choices described in policy text. Together, the results from these efforts show the effectiveness of using automated and semi-automated methods for extracting from privacy policies the data practice details that are salient to Internet users’ interests.

[1]  Lydia B. Chilton,et al.  Cascade: crowdsourcing taxonomy creation , 2013, CHI.

[2]  Noah A. Smith,et al.  Unsupervised Alignment of Privacy Policies using Hidden Markov Models , 2014, ACL.

[3]  Benjamin Fabian,et al.  Readability of Privacy Policies of Healthcare Websites , 2015, Wirtschaftsinformatik.

[4]  Blase Ur,et al.  A Large-Scale Evaluation of U.S. Financial Institutions’ Standardized Privacy Notices , 2016 .

[5]  Lorrie Faith Cranor,et al.  The platform for privacy preferences , 1999, CACM.

[6]  Alessandro Acquisti,et al.  Expecting the Unexpected: Understanding Mismatched Privacy Expectations Online , 2016, SOUPS.

[7]  Lorrie Faith Cranor,et al.  A Design Space for Effective Privacy Notices , 2015, SOUPS.

[8]  G. D. Liveing,et al.  The University of Cambridge , 1897, British medical journal.

[9]  Aniket Kittur,et al.  Crowd synthesis: extracting categories and clusters from complex data , 2014, CSCW.

[10]  Ryan A. Rossi,et al.  Automatically identifying relations in privacy policies , 2009, SIGDOC '09.

[11]  Noah A. Smith,et al.  Crowdsourcing Annotations for Websites' Privacy Policies: Can It Really Work? , 2016, WWW.

[12]  Travis D. Breaux,et al.  Scaling requirements extraction to the crowd: Experiments with privacy policies , 2014, 2014 IEEE 22nd International Requirements Engineering Conference (RE).

[13]  Norman Sadeh,et al.  Helping Users Understand Privacy Notices with Automated Query Answering Functionality : An Exploratory Study , 2018 .

[14]  Bin Liu,et al.  Automated Analysis of Privacy Requirements for Mobile Apps , 2016, NDSS.

[15]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[16]  Norman M. Sadeh,et al.  PrivOnto: A semantic framework for the analysis of privacy policies , 2017 .

[17]  Lorrie Faith Cranor,et al.  Designing Effective Privacy Notices and Controls , 2017, IEEE Internet Computing.

[18]  Marc Langheinrich,et al.  The platform for privacy preferences 1.0 (p3p1.0) specification , 2002 .

[19]  Noah A. Smith,et al.  A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements , 2014, COLING.

[20]  Milan Petkovic,et al.  What websites know about you : privacy policy analysis using information extraction , 2013 .

[21]  Akira Shimazu,et al.  Learning Logical Structures of Paragraphs in Legal Articles , 2011, IJCNLP.

[22]  Travis D. Breaux,et al.  An Evaluation of Constituency-Based Hyponymy Extraction from Privacy Policies , 2017, 2017 IEEE 25th International Requirements Engineering Conference (RE).

[23]  Lorrie Faith Cranor,et al.  Necessary But Not Sufficient: Standardized Mechanisms for Privacy Notice and Choice , 2012, J. Telecommun. High Technol. Law.

[24]  Aniket Kittur,et al.  CrowdForge: crowdsourcing complex work , 2011, UIST.

[25]  Gabriele Meiselwitz,et al.  Readability Assessment of Policies and Procedures of Social Networking Sites , 2013, HCI.

[26]  Adam N. Joinson,et al.  Privacy, Trust, and Self-Disclosure Online , 2010, Hum. Comput. Interact..

[27]  Noah A. Smith,et al.  The Usable Privacy Policy Project : Combining Crowdsourcing , Machine Learning and Natural Language Processing to Semi-Automatically Answer Those Privacy Questions Users Care About , 2014 .

[28]  Steven M. Bellovin,et al.  Privee: An Architecture for Automatically Analyzing Web Privacy Policies , 2014, USENIX Security Symposium.

[29]  Paul Compton,et al.  Combining Different Summarization Techniques for Legal Text , 2012 .

[30]  Colin Potts,et al.  Privacy policies as decision-making tools: an evaluation of online privacy notices , 2004, CHI.

[31]  Norman M. Sadeh,et al.  Crowdsourcing privacy policy analysis: Potential, challenges and best practices , 2016, it Inf. Technol..

[32]  Aleecia M. McDonald,et al.  The Cost of Reading Privacy Policies , 2009 .

[33]  Norman M. Sadeh,et al.  Identifying the Provision of Choices in Privacy Policy Text , 2017, EMNLP.

[34]  Karl Aberer,et al.  An Evaluation of Aggregation Techniques in Crowdsourcing , 2013, WISE.

[35]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[36]  Eduard Hovy,et al.  Demystifying Privacy Policies with Language Technologies : Progress and Challenges , 2016 .

[37]  Frederick Liu,et al.  The Creation and Analysis of a Website Privacy Policy Corpus , 2016, ACL.

[38]  Travis D. Breaux,et al.  A Theory of Vagueness and Privacy Risk Perception , 2016, 2016 IEEE 24th International Requirements Engineering Conference (RE).

[39]  Fei Liu,et al.  Modeling Language Vagueness in Privacy Policies using Deep Neural Networks , 2018, AAAI Fall Symposia.

[40]  Fred H. Cate,et al.  The Limits of Notice and Choice , 2010, IEEE Security & Privacy.

[41]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[42]  Tom Rodden,et al.  Consent for all: revealing the hidden complexity of terms and conditions , 2013, CHI.

[43]  Karel Pala,et al.  Semantic Processing of Legal Texts , 2008 .

[44]  Simonetta Montemagni,et al.  Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of Language , 2010, Semantic Processing of Legal Texts.

[45]  Jianwei Niu,et al.  Lexical Similarity of Information Type Hypernyms, Meronyms and Synonyms in Privacy Policies , 2016, AAAI Fall Symposia.

[46]  Mark S. Ackerman,et al.  Privacy in e-commerce: examining user scenarios and privacy preferences , 1999, EC '99.

[47]  Anupam Das,et al.  Personalized Privacy Assistants for the Internet of Things: Providing Users with Notice and Choice , 2018, IEEE Pervasive Computing.

[48]  Benjamin B. Bederson,et al.  Human computation: a survey and taxonomy of a growing field , 2011, CHI.

[49]  Travis D. Breaux,et al.  Ambiguity in Privacy Policies and the Impact of Regulation , 2016, The Journal of Legal Studies.

[50]  Frederick Liu,et al.  Analyzing Vocabulary Intersections of Expert Annotations and Topic Models for Data Practices in Privacy Policies , 2016, AAAI Fall Symposia.

[51]  Lorrie Faith Cranor,et al.  Disagreeable Privacy Policies: Mismatches between Meaning and Users’ Understanding , 2014 .

[52]  Travis D. Breaux,et al.  Mining Privacy Goals from Privacy Policies Using Hybridized Task Recomposition , 2016, ACM Trans. Softw. Eng. Methodol..

[53]  Yang Wang,et al.  What matters to users?: factors that affect users' willingness to share information with online advertisers , 2013, SOUPS.

[54]  Thomas B. Norton,et al.  Privacy Harms and the Effectiveness of the Notice and Choice Framework , 2014 .

[55]  Noah A. Smith,et al.  Automatic Categorization of Privacy Policies: A Pilot Study , 2012 .

[56]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[57]  Matteo Negri,et al.  Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora , 2011, EMNLP.

[58]  Ram Krishnan,et al.  Toward a Framework for Detecting Privacy Policy Violations in Android Application Code , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).