A Large Publicly Available Corpus of Website Privacy Policies Based on DMOZ

Studies have shown website privacy policies are too long and hard to comprehend for their target audience. These studies and a more recent body of research that utilizes machine learning and natural language processing to automatically summarize privacy policies greatly benefit, if not rely on, corpora of privacy policies collected from the web. While there have been smaller annotated corpora of web privacy policies made public, we are not aware of any large publicly available corpus. We use DMOZ, a massive open-content directory of the web, and its manually categorized 1.5 million websites, to collect hundreds of thousands of privacy policies associated with their categories, enabling research on privacy policies across different categories/market sectors. We review the statistics of this corpus and make it available for research. We also obtain valuable insights about privacy policies, e.g., which websites post them less often. Our corpus of web privacy policies is a valuable tool at the researchers' disposal to investigate privacy policies. For example, it facilitates comparison among different methods of privacy policy summarization by providing a benchmark, and can be used in unsupervised machine learning to summarize privacy policies.

[1]  Bin Liu,et al.  Automated Analysis of Privacy Requirements for Mobile Apps , 2016, NDSS.

[2]  Frederick Liu,et al.  The Creation and Analysis of a Website Privacy Policy Corpus , 2016, ACL.

[3]  Tao Xie,et al.  PolicyLint: Investigating Internal Privacy Policy Contradictions on Google Play , 2019, USENIX Security Symposium.

[4]  Byung-Gon Chun,et al.  TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones , 2010, OSDI.

[5]  Anne Oeldorf-Hirsch,et al.  The Biggest Lie on the Internet: Ignoring the Privacy Policies and Terms of Service Policies of Social Networking Services , 2020 .

[6]  Nokhbeh ZaeemRazieh,et al.  The Effect of the GDPR on Privacy Policies , 2020, ACM Trans. Manag. Inf. Syst..

[7]  Daniel Kales,et al.  Mobile Private Contact Discovery at Scale , 2019, IACR Cryptol. ePrint Arch..

[8]  Toru Nakamura,et al.  I Read but Don't Agree: Privacy Policy Benchmarking using Machine Learning and the EU GDPR , 2018, WWW.

[9]  Razieh Nokhbeh Zaeem,et al.  A study of web privacy policies across industries , 2017 .

[10]  Aleecia M. McDonald,et al.  The Cost of Reading Privacy Policies , 2009 .

[11]  Thorsten Holz,et al.  We Value Your Privacy ... Now Take Some Cookies: Measuring the GDPR's Impact on Web Privacy , 2019, NDSS.

[12]  Jacques Klein,et al.  FlowDroid: precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for Android apps , 2014, PLDI.

[13]  Razieh Nokhbeh Zaeem,et al.  PrivacyCheck v2: A Tool that Recaps Privacy Policies for You , 2020, CIKM.

[14]  Ali Sunyaev,et al.  Availability and quality of mobile health app privacy policies , 2015, J. Am. Medical Informatics Assoc..

[15]  Shinsaku Kiyomoto,et al.  PrivacyGuide: Towards an Implementation of the EU GDPR on Internet Privacy Policy Evaluation , 2018, IWSPA@CODASPY.

[16]  Noah A. Smith,et al.  Unsupervised Alignment of Privacy Policies using Hidden Markov Models , 2014, ACL.

[17]  Norman M. Sadeh,et al.  MAPS: Scaling Privacy Compliance Analysis to a Million Apps , 2019, Proc. Priv. Enhancing Technol..

[18]  Hana Habib,et al.  Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text , 2020, WWW.

[19]  Jerry den Hartog,et al.  A machine learning solution to assess privacy policy completeness: (short paper) , 2012, WPES '12.

[20]  Narseo Vallina-Rodriguez,et al.  Tales from the Porn: A Comprehensive Privacy Analysis of the Web Porn Ecosystem , 2019, Internet Measurement Conference.

[21]  Vinayshekhar Bannihatti Kumar,et al.  Quantifying the Effect of In-Domain Distributed Word Representations : A Study of Privacy Policies , 2019 .

[22]  C. Lee Giles,et al.  Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies , 2020, ACL.

[23]  Kang G. Shin,et al.  Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning , 2018, USENIX Security Symposium.

[24]  William Enck,et al.  Actions Speak Louder than Words: Entity-Sensitive Privacy Policy and Data Flow Analysis with PoliCheck , 2020, USENIX Security Symposium.

[25]  M. Graber,et al.  Reading level of privacy policies on Internet health Web sites. , 2002, The Journal of family practice.