PrivaSeer: A Privacy Policy Search Engine

Web privacy policies are used by organisations to disclose their privacy practices to users on the web. However, users often do not read privacy policies because they are too long, time consuming, or too complicated. Attempts to simplify privacy policies using natural language processing have achieved some success, but they face limitations of scalability and generalization. While this puts an onus on researchers and policy regulators to protect users against unfair privacy practices, they often lack a large-scale collection of policies to study the state of internet privacy. To remedy this bottleneck, we present PrivaSeer, the first privacy policy search engine. PrivaSeer has been indexed on 1,400,318 English language website privacy policies and can be used to search privacy policies based on text queries and several search facets. Results can be ranked by PageRank, query-based document relevance, and the probability that a document is a privacy policy. Results also can be filtered by readability, vagueness, industry, and mentions of tracking technology, self-regulatory bodies, or regulations and cross-border agreements in the policy text. PrivaSeer allows legal experts, researchers, and policy regulators to discover privacy trends and policy anomalies in privacy policies at scale. In this paper we present the search interface, ranking technique, and filtering techniques for PrivaSeer. We create two indexes of privacy policies: one including supplementary non-policy content present in privacy policy web pages and one without. We evaluate the functionality of PrivaSeer by comparing ranking techniques on these two indexes.

[1]  K. Suzanne Barber,et al.  PrivacyCheck , 2018, ACM Trans. Internet Techn..

[2]  David Wright,et al.  Developing a privacy seal scheme (that works) , 2013 .

[3]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[4]  Arvind Narayanan,et al.  Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset , 2020, WWW.

[5]  Robert H. Sloan,et al.  Beyond Notice and Choice: Privacy, Norms, and Consent , 2013 .

[6]  Benjamin Fabian,et al.  Readability of Privacy Policies of Healthcare Websites , 2015, Wirtschaftsinformatik.

[7]  Hana Habib,et al.  Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text , 2020, WWW.

[8]  Fei Liu,et al.  Automatic Detection of Vague Words and Sentences in Privacy Policies , 2018, EMNLP.

[9]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[10]  Travis D. Breaux,et al.  A Theory of Vagueness and Privacy Risk Perception , 2016, 2016 IEEE 24th International Requirements Engineering Conference (RE).

[11]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[12]  Frederick Liu,et al.  The Creation and Analysis of a Website Privacy Policy Corpus , 2016, ACL.

[13]  Aleecia M. McDonald,et al.  The Cost of Reading Privacy Policies , 2009 .

[14]  Lorrie Faith Cranor,et al.  Standardizing privacy notices: an online study of the nutrition label approach , 2010, CHI.

[15]  Norman Sadeh,et al.  Question Answering for Privacy Policies: Combining Computational and Legal Perspectives , 2019, EMNLP.

[16]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[17]  C. Lee Giles,et al.  Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies , 2020, ACL.

[18]  Kang G. Shin,et al.  Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning , 2018, USENIX Security Symposium.

[19]  Steven M. Bellovin,et al.  Privee: An Architecture for Automatically Analyzing Web Privacy Policies , 2014, USENIX Security Symposium.

[20]  Svenja Polst,et al.  Why Users Ignore Privacy Policies - A Survey and Intention Model for Explaining User Privacy Behavior , 2018, HCI.

[21]  Benjamin Fabian,et al.  Large-scale readability analysis of privacy policies , 2017, WI.

[22]  Jasmine Novak,et al.  PageRank Computation and the Structure of the Web: Experiments and Algorithms , 2002 .