Unsupervised Topic Extraction from Privacy Policies

This paper suggests the use of automatic topic modeling for large-scale corpora of privacy policies using unsupervised learning techniques. The advantages of using unsupervised learning for this task are numerous. The primary advantages include the ability to analyze any new corpus with a fraction of the effort required by supervised learning, the ability to study changes in topics of interest along time, and the ability to identify finer-grained topics of interest in these privacy policies. Based on general principles of document analysis we synthesize a cohesive framework for privacy policy topic modeling and apply it over a corpus of 4,982 privacy policies of mobile applications crawled from the Google Play Store. The results demonstrate that even with this relatively moderate-size corpus quite comprehensive insights can be attained regarding the focus and scope of current privacy policy documents. The topics extracted, their structure and the applicability of the unsupervised approach for that matter are validated through an extensive comparison to similar findings reported in prior work that uses supervised learning (which heavily depends on manual annotation of experts). The comparison suggests a substantial overlap between the topics found and those reported in prior work, and also unveils some new topics of interest.

[1]  Ian Stewart,et al.  Tales of a Neglected Number , 1996 .

[2]  O. Ben-shahar,et al.  Simplification of Privacy Disclosures: An Experimental Test , 2016, The Journal of Legal Studies.

[3]  Matthew B. Kugler,et al.  Is Privacy Policy Language Irrelevant to Consumers? , 2016, The Journal of Legal Studies.

[4]  Frederick Liu,et al.  Towards Automatic Classification of Privacy Policy Text , 2017 .

[5]  Yannis Bakos,et al.  Does Anyone Read the Fine Print? Consumer Attention to Standard-Form Contracts , 2014, The Journal of Legal Studies.

[6]  Jerry den Hartog,et al.  A machine learning solution to assess privacy policy completeness: (short paper) , 2012, WPES '12.

[7]  Noah A. Smith,et al.  Unsupervised Alignment of Privacy Policies using Hidden Markov Models , 2014, ACL.

[8]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[9]  Florencia Marotta-Wurgler,et al.  Does Contract Disclosure Matter? , 2012 .

[10]  Derek Greene,et al.  Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach , 2016, Political Analysis.

[11]  Norman M. Sadeh,et al.  Identifying the Provision of Choices in Privacy Policy Text , 2017, EMNLP.

[12]  So Young Sohn,et al.  Analyzing research trends in personal information privacy using topic modeling , 2017, Comput. Secur..

[13]  Noah A. Smith,et al.  A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements , 2014, COLING.

[14]  Kang G. Shin,et al.  Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning , 2018, USENIX Security Symposium.

[15]  Helen Nissenbaum,et al.  Analyzing Privacy Policies Using Contextual Integrity Annotations , 2018, ArXiv.

[16]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[17]  Frederick Liu,et al.  The Creation and Analysis of a Website Privacy Policy Corpus , 2016, ACL.

[18]  Bing Liu,et al.  Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data , 2014, ICML.

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[21]  Kassem Fawaz,et al.  The Privacy Policy Landscape After the GDPR , 2018, Proc. Priv. Enhancing Technol..

[22]  Thorsten Holz,et al.  We Value Your Privacy ... Now Take Some Cookies: Measuring the GDPR's Impact on Web Privacy , 2019, NDSS.

[23]  E. Tronci,et al.  1996 , 1997, Affair of the Heart.

[24]  Fei Liu,et al.  Modeling Language Vagueness in Privacy Policies using Deep Neural Networks , 2018, AAAI Fall Symposia.

[25]  Bin Liu,et al.  Automated Analysis of Privacy Requirements for Mobile Apps , 2016, NDSS.

[26]  Glen Coppersmith,et al.  In your wildest dreams: the language and psychological features of dreams , 2017, CLPsych@ACL.

[27]  Yue Lu,et al.  Opinion integration through semi-supervised topic modeling , 2008, WWW.