A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements

With the rapid development of web-based services, concerns about user privacy have heightened. The privacy policies of online websites, which serve as a legal agreement between service providers and users, are not easy for people to understand and therefore offer an opportunity for natural language processing. In this paper, we consider a corpus of these policies, and tackle the problem of aligning or grouping segments of policies based on the privacy issues they address. A dataset of pairwise judgments from humans is used to evaluate two methods, one based on clustering and another based on a hidden Markov model. Our analysis suggests a five-point gap between system and median-human levels of agreement with a consensus annotation, of which half can be closed with bag of words representations and half requires more sophistication.

[1]  Regina Barzilay,et al.  In-domain Relation Discovery with Meta-constraints via Posterior Regularization , 2011, ACL.

[2]  Ashwini Rao,et al.  Eddy, a formal language for specifying and analyzing data flow specifications for conflicting privacy requirements , 2014, Requirements Engineering.

[3]  Norman M. Sadeh,et al.  Reconciling mobile app privacy and usability on smartphones: could user privacy profiles help? , 2014, WWW.

[4]  J. Reeve,et al.  Solutions to problematic polypharmacy: learning from the expertise of patients. , 2015, The British journal of general practice : the journal of the Royal College of General Practitioners.

[5]  Noah A. Smith,et al.  Automatic Categorization of Privacy Policies: A Pilot Study , 2012 .

[6]  Noah A. Smith,et al.  The Usable Privacy Policy Project : Combining Crowdsourcing , Machine Learning and Natural Language Processing to Semi-Automatically Answer Those Privacy Questions Users Care About , 2014 .

[7]  Steven M. Bellovin,et al.  Privee: An Architecture for Automatically Analyzing Web Privacy Policies , 2014, USENIX Security Symposium.

[8]  Nora Cuppens-Boulahia,et al.  Data Privacy Management and Autonomous Spontaneous Security , 2014, Lecture Notes in Computer Science.

[9]  Mark Johnson,et al.  Why Doesn’t EM Find Good HMM POS-Taggers? , 2007, EMNLP.

[10]  Noah A. Smith,et al.  Unsupervised Alignment of Privacy Policies using Hidden Markov Models , 2014, ACL.

[11]  Victor Raskin,et al.  Reconciling Privacy Policies and Regulations: Ontological Semantics Perspective , 2007, HCI.

[12]  Lorrie Faith Cranor,et al.  Standardizing privacy notices: an online study of the nutrition label approach , 2010, CHI.

[13]  Tao Xie,et al.  Automated extraction of security policies from natural-language software documents , 2012, SIGSOFT FSE.

[14]  Jerry den Hartog,et al.  What Websites Know About You , 2012, DPM/SETOP.

[15]  Aleecia M. McDonald,et al.  The Cost of Reading Privacy Policies , 2009 .

[16]  Regina Barzilay,et al.  Bayesian Unsupervised Topic Segmentation , 2008, EMNLP.

[17]  Norman M. Sadeh,et al.  Expectation and purpose: understanding users' mental models of mobile app privacy through crowdsourcing , 2012, UbiComp.

[18]  Jianfeng Gao,et al.  A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers , 2008, EMNLP.

[19]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[20]  Clare-Marie Karat,et al.  An empirical study of natural language parsing of privacy policy rules using the SPARCLE policy workbench , 2006, SOUPS '06.

[21]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[22]  Lorrie Faith Cranor,et al.  Web Privacy with P3p , 2002 .

[23]  Jerry den Hartog,et al.  A machine learning solution to assess privacy policy completeness: (short paper) , 2012, WPES '12.

[24]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[25]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[26]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[27]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[28]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .