Automatic Detection of Vague Words and Sentences in Privacy Policies

Website privacy policies represent the single most important source of information for users to gauge how their personal data are collected, used and shared by companies. However, privacy policies are often vague and people struggle to understand the content. Their opaqueness poses a significant challenge to both users and policy regulators. In this paper, we seek to identify vague content in privacy policies. We construct the first corpus of human-annotated vague words and sentences and present empirical studies on automatic vagueness detection. In particular, we investigate context-aware and context-agnostic models for predicting vague words, and explore auxiliary-classifier generative adversarial networks for characterizing sentence vagueness. Our experimental results demonstrate the effectiveness of proposed approaches. Finally, we provide suggestions for resolving vagueness and improving the usability of privacy policies.

[1]  Dale A. Thorpe,et al.  The sorites paradox , 1984, Synthese.

[2]  J. Reeve,et al.  Solutions to problematic polypharmacy: learning from the expertise of patients. , 2015, The British journal of general practice : the journal of the Royal College of General Practitioners.

[3]  Ye Zhang,et al.  A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification , 2015, IJCNLP.

[4]  Noah A. Smith,et al.  A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements , 2014, COLING.

[5]  Lorrie Faith Cranor,et al.  Disagreeable Privacy Policies: Mismatches between Meaning and Users’ Understanding , 2014 .

[6]  Kees van Deemter Not Exactly: In Praise of Vagueness , 2010 .

[7]  Travis D. Breaux,et al.  A Theory of Vagueness and Privacy Risk Perception , 2016, 2016 IEEE 24th International Requirements Engineering Conference (RE).

[8]  S. Shapiro Vagueness in Context , 2006 .

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[11]  Norman M. Sadeh,et al.  Identifying the Provision of Choices in Privacy Policy Text , 2017, EMNLP.

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[14]  Florian Matthes,et al.  SaToS: Assessing and Summarising Terms of Services from German Webshops , 2017, INLG.

[15]  Ian J. Goodfellow,et al.  NIPS 2016 Tutorial: Generative Adversarial Networks , 2016, ArXiv.

[16]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[17]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[18]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[19]  Noah A. Smith,et al.  Crowdsourcing Annotations for Websites' Privacy Policies: Can It Really Work? , 2016, WWW.

[20]  David A. Norton Norton , 2000, The Lancet.

[21]  William Yang Wang,et al.  KBGAN: Adversarial Learning for Knowledge Graph Embeddings , 2017, NAACL.

[22]  Alexander M. Rush,et al.  LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks , 2016, IEEE Transactions on Visualization and Computer Graphics.

[23]  Noah A. Smith,et al.  Automatic Categorization of Privacy Policies: A Pilot Study , 2012 .

[24]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[25]  Alan Ritter,et al.  Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[26]  Noah A. Smith,et al.  The Usable Privacy Policy Project : Combining Crowdsourcing , Machine Learning and Natural Language Processing to Semi-Automatically Answer Those Privacy Questions Users Care About , 2014 .

[27]  R. Keefe Theories of vagueness , 2000 .

[28]  Noah A. Smith,et al.  Unsupervised Alignment of Privacy Policies using Hidden Markov Models , 2014, ACL.

[29]  Fei Liu,et al.  Modeling Language Vagueness in Privacy Policies using Deep Neural Networks , 2018, AAAI Fall Symposia.

[30]  Alexander M. Rush,et al.  Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks , 2016, ArXiv.

[31]  Frederick Liu,et al.  The Creation and Analysis of a Website Privacy Policy Corpus , 2016, ACL.

[32]  Aleecia M. McDonald,et al.  The Cost of Reading Privacy Policies , 2009 .