Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset

Automated analysis of privacy policies has proved a fruitful research direction, with developments such as automated policy summarization, question answering systems, and compliance detection. So far, prior research has been limited to analysis of privacy policies from a single point in time or from short spans of time, as researchers did not have access to a large-scale, longitudinal, curated dataset. To address this gap, we developed a crawler that discovers, downloads, and extracts archived privacy policies from the Internet Archive's Wayback Machine. Using the crawler and natural language processing, we curated a dataset of 1,071,488 English language privacy policies, spanning over two decades and over 130,000 distinct websites. Our analyses of the data show how the privacy policy landscape has changed over time and how websites have reacted to the evolving legal landscape, such as the adoption of privacy seals and the impact of new regulations such as the GDPR. Our results suggest that privacy policies underreport the presence of tracking technologies and third parties. We find that, over the last twenty years, privacy policies have more than doubled in length and the median reading level, while already challenging, has increased modestly.

[1]  A. Azzouz 2011 , 2020, City.

[2]  Lorrie Faith Cranor,et al.  A Comparative Study of Online Privacy Policies and Formats , 2009, Privacy Enhancing Technologies.

[3]  M. Culnan Protecting Privacy Online: Is Self-Regulation Working? , 2000 .

[4]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[5]  Thorsten Holz,et al.  We Value Your Privacy ... Now Take Some Cookies: Measuring the GDPR's Impact on Web Privacy , 2019, NDSS.

[6]  Noah A. Smith,et al.  Unsupervised Alignment of Privacy Policies using Hidden Markov Models , 2014, ACL.

[7]  Peter P. Swire Markets, Self-Regulation, and Government Enforcement in the Protection of Personal Information, in Privacy and Self-Regulation in the Information Age by the U.S. Department of Commerce. , 1997 .

[8]  Mary J. Culnan,et al.  Strategies for reducing online privacy risks: Why consumers read (or don't read) online privacy notices , 2004 .

[9]  Florencia Marotta-Wurgler Self-Regulation and Competition in Privacy Policies , 2016, The Journal of Legal Studies.

[10]  Yuanxiang Li,et al.  Online Privacy Policy of the Thirty Dow Jones Corporations: Compliance with FTC Fair Information Practice Principles and Readability Assessment , 2012 .

[11]  John C. Mitchell,et al.  Third-Party Web Tracking: Policy and Technology , 2012, 2012 IEEE Symposium on Security and Privacy.

[12]  Timothy Libert,et al.  An Automated Approach to Auditing Disclosure of Third-Party Data Collection in Website Privacy Policies , 2018, WWW.

[13]  Micha Elsner,et al.  Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2014 .

[14]  Ram Krishnan,et al.  Toward a Framework for Detecting Privacy Policy Violations in Android Application Code , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[15]  J. Reeve,et al.  Solutions to problematic polypharmacy: learning from the expertise of patients. , 2015, The British journal of general practice : the journal of the Royal College of General Practitioners.

[16]  Lorrie Faith Cranor,et al.  A comparative study of online privacy policies and formats , 2009, Privacy Enhancing Technologies.

[17]  Travis D. Breaux,et al.  Ambiguity in Privacy Policies and the Impact of Regulation , 2016, The Journal of Legal Studies.

[18]  Norman M. Sadeh,et al.  Identifying the Provision of Choices in Privacy Policy Text , 2017, EMNLP.

[19]  Annie I. Antón,et al.  An Empirical Study of Consumer Perceptions and Comprehension of Web Site Privacy Policies , 2008, IEEE Transactions on Engineering Management.

[20]  Arvind Narayanan,et al.  The Web Never Forgets: Persistent Tracking Mechanisms in the Wild , 2014, CCS.

[21]  Tadayoshi Kohno,et al.  Internet Jones and the Raiders of the Lost Trackers: An Archaeological Study of Web Tracking from 1996 to 2016 , 2016, USENIX Security Symposium.

[22]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[23]  Patrick Traynor,et al.  Regulators, Mount Up! Analysis of Privacy Policies for Mobile Money Services , 2017, SOUPS.

[24]  Philipp Koehn,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .

[25]  Patrick Sattler,et al.  Prefix Top Lists: Gaining Insights with Prefixes from Domain-based Top Lists on DNS Deployment , 2019, Internet Measurement Conference.

[26]  C. Lee Giles,et al.  Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies , 2020, ACL.

[27]  Kang G. Shin,et al.  Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning , 2018, USENIX Security Symposium.

[28]  Benjamin Fabian,et al.  Large-scale readability analysis of privacy policies , 2017, WI.

[29]  Hana Habib,et al.  "It's a scavenger hunt": Usability of Websites' Opt-Out and Data Deletion Choices , 2020, CHI.

[30]  Steven M. Bellovin,et al.  Privee: An Architecture for Automatically Analyzing Web Privacy Policies , 2014, USENIX Security Symposium.

[31]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[32]  Wouter Joosen,et al.  Parking Sensors: Analyzing and Detecting Parked Domains , 2015, NDSS.

[33]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[34]  P. Fearnhead,et al.  Optimal detection of changepoints with a linear computational cost , 2011, 1101.1438.

[35]  Michael L. Nelson,et al.  Not all mementos are created equal: measuring the impact of missing resources , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[36]  Norman M. Sadeh,et al.  MAPS: Scaling Privacy Compliance Analysis to a Million Apps , 2019, Proc. Priv. Enhancing Technol..

[37]  Neha Jain,et al.  HIPAA's Effect on Web Site Privacy Policies , 2007, IEEE Security & Privacy.

[38]  Hana Habib,et al.  Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text , 2020, WWW.

[39]  Arvind Narayanan,et al.  Online Tracking: A 1-million-site Measurement and Analysis , 2016, CCS.

[40]  Apis , 2021, Encyclopedic Dictionary of Archaeology.

[41]  K. Suzanne Barber,et al.  A Large Publicly Available Corpus of Website Privacy Policies Based on DMOZ , 2021, CODASPY.

[42]  Paul Barford,et al.  An Empirical Study of Web Cookies , 2016, WWW.

[43]  Arnaud Legout,et al.  Missed by Filter Lists: Detecting Unknown Third-Party Trackers with Invisible Pixels , 2020, Proc. Priv. Enhancing Technol..

[44]  Yi-Min Wang,et al.  Strider Typo-Patrol: Discovery and Analysis of Systematic Typo-Squatting , 2006, SRUTI.

[45]  Daniel J. Solove,et al.  The FTC and the New Common Law of Privacy , 2013 .

[46]  Ali Sunyaev,et al.  Availability and quality of mobile health app privacy policies , 2015, J. Am. Medical Informatics Assoc..

[47]  G. Zinkhan,et al.  Exploring the Impact of Online Privacy Disclosures on Consumer Trust , 2006 .

[48]  Frederick Liu,et al.  The Creation and Analysis of a Website Privacy Policy Corpus , 2016, ACL.

[49]  N. Vayatis,et al.  Selective review of offline change point detection methods , 2019 .

[50]  George R. Milne,et al.  A Longitudinal Assessment of Online Privacy Notice Readability , 2006 .

[51]  Aleecia M. McDonald,et al.  The Cost of Reading Privacy Policies , 2009 .

[52]  Colin Potts,et al.  Privacy policies as decision-making tools: an evaluation of online privacy notices , 2004, CHI.

[53]  Travis D. Breaux,et al.  Ambiguity in Privacy Policies and the Impact of Regulation , 2016, The Journal of Legal Studies.

[54]  Ada Lerner,et al.  Rewriting History: Changing the Archived Web from the Present , 2017, CCS.

[55]  David Wright,et al.  Developing a privacy seal scheme (that works) , 2013 .

[56]  Michael L. Nelson,et al.  The impact of JavaScript on archivability , 2015, International Journal on Digital Libraries.

[57]  Noah A. Smith,et al.  Crowdsourcing Annotations for Websites' Privacy Policies: Can It Really Work? , 2016, WWW.

[58]  T. House A FRAMEWORK FOR GLOBAL ELECTRONIC COMMERCE , 2021, Profit over Privacy.

[59]  Matthew B. Kugler,et al.  Is Privacy Policy Language Irrelevant to Consumers? , 2016, The Journal of Legal Studies.

[60]  Lorrie Faith Cranor,et al.  Disagreeable Privacy Policies: Mismatches between Meaning and Users’ Understanding , 2014 .

[61]  C. Hoofnagle,et al.  The European Union general data protection regulation: what it is and what it means* , 2019, Information & Communications Technology Law.

[62]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[63]  Kassem Fawaz,et al.  The Privacy Policy Landscape After the GDPR , 2018, Proc. Priv. Enhancing Technol..

[64]  R. Posner The Federal Trade Commission , 1969 .

[65]  Mary J. Culnan,et al.  Using the Content of Online Privacy Notices to Inform Public Policy: A Longitudinal Analysis of the 1998-2001 U.S. Web Surveys , 2002, Inf. Soc..

[66]  D. Picard Testing and estimating change-points in time series , 1985, Advances in Applied Probability.

[67]  조석주,et al.  교과서 문장의 Readability , 1985 .