Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

Organisations disclose their privacy practices by posting privacy policies on their website. Even though users often care about their digital privacy, they often don't read privacy policies since they require a significant investment in time and effort. Although natural language processing can help in privacy policy understanding, there has been a lack of large scale privacy policy corpora that could be used to analyse, understand, and simplify privacy policies. Thus, we create PrivaSeer, a corpus of over one million English language website privacy policies, which is significantly larger than any previously available corpus. We design a corpus creation pipeline which consists of crawling the web followed by filtering documents using language detection, document classification, duplicate and near-duplication removal, and content extraction. We investigate the composition of the corpus and show results from readability tests, document similarity, keyphrase extraction, and explored the corpus through topic modeling.

[1]  Lorrie Faith Cranor,et al.  Standardizing privacy notices: an online study of the nutrition label approach , 2010, CHI.

[2]  Noah A. Smith,et al.  Unsupervised Alignment of Privacy Policies using Hidden Markov Models , 2014, ACL.

[3]  Norman Sadeh,et al.  Question Answering for Privacy Policies: Combining Computational and Legal Perspectives , 2019, EMNLP.

[4]  L. Cranor,et al.  Are They Actually Any Different? Comparing Thousands of Financial Institutions’ Privacy Practices , 2013 .

[5]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[6]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[7]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[8]  Norman M. Sadeh,et al.  Identifying the Provision of Choices in Privacy Policy Text , 2017, EMNLP.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[11]  Lorrie Faith Cranor,et al.  A Design Space for Effective Privacy Notices , 2015, SOUPS.

[12]  Norman M. Sadeh,et al.  MAPS: Scaling Privacy Compliance Analysis to a Million Apps , 2019, Proc. Priv. Enhancing Technol..

[13]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[14]  Valter Crescenzi,et al.  Web Content Extraction: a MetaAnalysis of its Past and Thoughts on its Future , 2016, SKDD.

[15]  Svenja Polst,et al.  Why Users Ignore Privacy Policies - A Survey and Intention Model for Explaining User Privacy Behavior , 2018, HCI.

[16]  Fei Liu,et al.  Automatic Detection of Vague Words and Sentences in Privacy Policies , 2018, EMNLP.

[17]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[18]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[19]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[20]  Shomir Wilson,et al.  PrivaSeer: A Privacy Policy Search Engine , 2021, ICWE.

[21]  Simon Scerri,et al.  Establishing a Strong Baseline for Privacy Policy Classification , 2020, SEC.

[22]  George R. Klare,et al.  The measurement of readability , 1963 .

[23]  Frederick Liu,et al.  The Creation and Analysis of a Website Privacy Policy Corpus , 2016, ACL.

[24]  Kang G. Shin,et al.  Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning , 2018, USENIX Security Symposium.

[25]  Anne Oeldorf-Hirsch,et al.  The Biggest Lie on the Internet: Ignoring the Privacy Policies and Terms of Service Policies of Social Networking Services , 2020 .

[26]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[27]  Monika Henzinger,et al.  Purely URL-based topic classification , 2009, WWW '09.

[28]  Aleecia M. McDonald,et al.  The Cost of Reading Privacy Policies , 2009 .

[29]  Nick Cramer,et al.  Automatic Keyword Extraction from Individual Documents , 2010 .

[30]  Norman M. Sadeh,et al.  Automatic Extraction of Opt-Out Choices from Privacy Policies , 2016, AAAI Fall Symposia.

[31]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[32]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[33]  Arvind Narayanan,et al.  Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset , 2020, WWW.

[34]  Colin Potts,et al.  Privacy policies as decision-making tools: an evaluation of online privacy notices , 2004, CHI.

[35]  Matthew E. Peters,et al.  Content extraction using diverse feature sets , 2013, WWW.

[36]  Mary Madden,et al.  Privacy, security, and digital inequality , 2017 .

[37]  Ziqi Wang,et al.  Natural Language Processing for Mobile App Privacy Compliance , 2019 .

[38]  Thomas Gottron EVALUATING CONTENT EXTRACTION ON HTML DOCUMENTS , 2007 .

[39]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[40]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[41]  Jerry den Hartog,et al.  A machine learning solution to assess privacy policy completeness: (short paper) , 2012, WPES '12.

[42]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[43]  K. Suzanne Barber,et al.  A Large Publicly Available Corpus of Website Privacy Policies Based on DMOZ , 2021, CODASPY.

[44]  Gabriele Meiselwitz,et al.  Readability Assessment of Policies and Procedures of Social Networking Sites , 2013, HCI.

[45]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[46]  Benjamin Fabian,et al.  Readability of Privacy Policies of Healthcare Websites , 2015, Wirtschaftsinformatik.

[47]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[48]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[49]  Vinayshekhar Bannihatti Kumar,et al.  Quantifying the Effect of In-Domain Distributed Word Representations : A Study of Privacy Policies , 2019 .

[50]  Benjamin Fabian,et al.  Large-scale readability analysis of privacy policies , 2017, WI.

[51]  Jasmine Novak,et al.  PageRank Computation and the Structure of the Web: Experiments and Algorithms , 2002 .

[52]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[53]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[54]  Robert H. Sloan,et al.  Beyond Notice and Choice: Privacy, Norms, and Consent , 2013 .

[55]  David Sarne,et al.  Unsupervised Topic Extraction from Privacy Policies , 2019, WWW.

[56]  Timothy Libert,et al.  An Automated Approach to Auditing Disclosure of Third-Party Data Collection in Website Privacy Policies , 2018, WWW.

[57]  J. Reeve,et al.  Solutions to problematic polypharmacy: learning from the expertise of patients. , 2015, The British journal of general practice : the journal of the Royal College of General Practitioners.