Intent Classification and Slot Filling for Privacy Policies

Understanding privacy policies is crucial for users as it empowers them to learn about the information that matters to them. Sentences written in a privacy policy document explain privacy practices, and the constituent text spans convey further specific information about that practice. We refer to predicting the privacy practice explained in a sentence as intent classification and identifying the text spans sharing specific information as slot filling. In this work, we propose PolicyIE, a corpus consisting of 5,250 intent and 11,788 slot annotations spanning 31 privacy policies of websites and mobile applications. PolicyIE corpus is a challenging benchmark with limited labeled examples reflecting the cost of collecting largescale annotations. We present two alternative neural approaches as baselines: (1) formulating intent classification and slot filling as a joint sequence tagging and (2) modeling them as a sequence-to-sequence (Seq2Seq) learning task. Experiment results show that both approaches perform comparably in intent classification, while the Seq2Seq method outperforms the sequence tagging approach in slot filling by a large margin. Error analysis reveals the deficiency of the baseline approaches, suggesting room for improvement in future works. We hope the PolicyIE corpus will stimulate future research in this domain.

[1]  Saleh Soltan,et al.  Don’t Parse, Insert: Multilingual Semantic Parsing with Insertion Based Decoding , 2020, CONLL.

[2]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[3]  Furu Wei,et al.  MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[4]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[5]  Norman M. Sadeh,et al.  MAPS: Scaling Privacy Compliance Analysis to a Million Apps , 2019, Proc. Priv. Enhancing Technol..

[6]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[7]  Aleecia M. McDonald,et al.  The Cost of Reading Privacy Policies , 2009 .

[8]  Mari Ostendorf,et al.  A general framework for information extraction using dynamic span graphs , 2019, NAACL.

[9]  Federal Trade Commission Protecting Consumer Privacy in an Era of Rapid Change - A Proposed Framework for Businesses and Policymakers (Preliminary FTC Staff Report) , 2011, J. Priv. Confidentiality.

[10]  Frederick Liu,et al.  The Creation and Analysis of a Website Privacy Policy Corpus , 2016, ACL.

[11]  Houfeng Wang,et al.  A Joint Model of Intent Determination and Slot Filling for Spoken Language Understanding , 2016, IJCAI.

[12]  Gargi Dasgupta,et al.  Semantic Parsing for Technical Support Questions , 2018, COLING.

[13]  Yuan Tian,et al.  PolicyQA: A Reading Comprehension Dataset for Privacy Policies , 2020, FINDINGS.

[14]  Jianfeng Gao,et al.  UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training , 2020, ICML.

[15]  Wei Xu,et al.  End-to-end learning of semantic role labeling using recurrent neural networks , 2015, ACL.

[16]  Noah A. Smith,et al.  Crowdsourcing Annotations for Websites' Privacy Policies: Can It Really Work? , 2016, WWW.

[17]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[18]  Norman M. Sadeh,et al.  Automatic Extraction of Opt-Out Choices from Privacy Policies , 2016, AAAI Fall Symposia.

[19]  Weijia Xu,et al.  End-to-End Slot Alignment and Recognition for Cross-Lingual NLU , 2020, EMNLP.

[20]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[21]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[22]  Wen Wang,et al.  BERT for Joint Intent Classification and Slot Filling , 2019, ArXiv.

[23]  Kang G. Shin,et al.  Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning , 2018, USENIX Security Symposium.

[24]  Jean Carletta,et al.  Squibs: Reliability Measurement without Limits , 2008, CL.

[25]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[26]  Emilio Monti,et al.  Don’t Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing , 2020, WWW.

[27]  Sebastian Schuster,et al.  Cross-lingual Transfer Learning for Multilingual Task Oriented Dialog , 2018, NAACL.

[28]  Lorrie Faith Cranor,et al.  How Short Is Too Short? Implications of Length and Framing on the Effectiveness of Privacy Notices , 2016, SOUPS.

[29]  Francesco Caltagirone,et al.  Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces , 2018, ArXiv.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Jerry den Hartog,et al.  What Websites Know About You , 2012, DPM/SETOP.

[32]  Gökhan Tür,et al.  (Almost) Zero-Shot Cross-Lingual Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Noah A. Smith,et al.  A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements , 2014, COLING.

[34]  Hana Habib,et al.  Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text , 2020, WWW.

[35]  Norman M. Sadeh,et al.  Identifying the Provision of Choices in Privacy Policy Text , 2017, EMNLP.

[36]  Travis D. Breaux,et al.  Automated Extraction of Regulated Information Types Using Hyponymy Relations , 2016, 2016 IEEE 24th International Requirements Engineering Conference Workshops (REW).

[37]  Hannaneh Hajishirzi,et al.  Entity, Relation, and Event Extraction with Contextualized Span Representations , 2019, EMNLP.

[38]  Serge Egelman,et al.  Identifying and Classifying Third-party Entities in Natural Language Privacy Policies , 2020, PRIVATENLP.

[39]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[40]  Jianwei Niu,et al.  Lexical Similarity of Information Type Hypernyms, Meronyms and Synonyms in Privacy Policies , 2016, AAAI Fall Symposia.

[41]  Ananth Balashankar,et al.  RECIPE: Applying Open Domain Question Answering to Privacy Policies , 2018, QA@ACL.

[42]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[43]  Travis D. Breaux,et al.  Towards an information type lexicon for privacy policies , 2015, 2015 IEEE Eighth International Workshop on Requirements Engineering and Law (RELAW).

[44]  Travis D. Breaux,et al.  Ambiguity in Privacy Policies and the Impact of Regulation , 2016, The Journal of Legal Studies.

[45]  Noah A. Smith,et al.  Unsupervised Alignment of Privacy Policies using Hidden Markov Models , 2014, ACL.

[46]  Haoran Li,et al.  MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark , 2020, EACL.

[47]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[48]  Bing Liu,et al.  Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling , 2016, INTERSPEECH.

[49]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[50]  Norman Sadeh,et al.  Question Answering for Privacy Policies: Combining Computational and Legal Perspectives , 2019, EMNLP.