论文信息 - Differential Privacy for Text Analytics via Natural Text Sanitization

Differential Privacy for Text Analytics via Natural Text Sanitization

Texts convey sophisticated knowledge. However, texts also convey sensitive information. Despite the success of general-purpose language models and domain-specific mechanisms with differential privacy (DP), existing text sanitization mechanisms still provide low utility, as cursed by the high-dimensional text representation. The companion issue of utilizing sanitized texts for downstream analytics is also under-explored. This paper takes a direct approach to text sanitization. Our insight is to consider both sensitivity and similarity via our new local DP notion. The sanitized texts also contribute to our sanitization-aware pretraining and fine-tuning, enabling privacypreserving natural language processing over the BERT language model with promising utility. Surprisingly, the high utility does not boost up the success rate of inference attacks.

[1] Yoav Goldberg,et al. Adversarial Removal of Demographic Attributes from Text Data , 2018, EMNLP.

[2] Ninghui Li,et al. Locally Differentially Private Protocols for Frequency Estimation , 2017, USENIX Security Symposium.

[3] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[4] Wei-Hung Weng,et al. Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[5] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[6] Divesh Srivastava,et al. Marginal Release Under Local Differential Privacy , 2017, SIGMOD Conference.

[7] L. Sweeney. Only You, Your Doctor, and Many Others May Know , 2015 .

[8] Geoffrey C. Fox,et al. Glyph: Fast and Accurately Training Deep Neural Networks on Encrypted Data , 2019, NeurIPS.

[9] Martin J. Wainwright,et al. Local privacy and statistical minimax rates , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[10] Dan Boneh,et al. Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware , 2018, ICLR.

[11] Hongfang Liu,et al. MedSTS: a resource for clinical semantic textual similarity , 2018, Language Resources and Evaluation.

[12] Ninghui Li,et al. Locally Differentially Private Frequent Itemset Mining , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[13] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[14] Catuscia Palamidessi,et al. Broadening the Scope of Differential Privacy Using Metrics , 2013, Privacy Enhancing Technologies.

[15] Congzheng Song,et al. Information Leakage in Embedding Models , 2020, CCS.

[16] Yitong Li,et al. Towards Differentially Private Text Representations , 2020, SIGIR.

[17] Colin Raffel,et al. Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[18] G. Crooks. On Measures of Entropy and Information , 2015 .

[19] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20] Mi Zhang,et al. Privacy Risks of General-Purpose Language Models , 2020, 2020 IEEE Symposium on Security and Privacy (SP).

[21] Graham Neubig,et al. Controllable Invariance through Adversarial Feature Learning , 2017, NIPS.

[22] M. Alvim,et al. Local Differential Privacy on Metric Spaces : optimizing the trade-off with utility ( Invited Paper ) , 2018 .

[23] Shashi Narayan,et al. Privacy-preserving Neural Representations of Text , 2018, EMNLP.

[24] Gemma Boleda,et al. Wikicorpus: A Word-Sense Disambiguated Multilingual Wikipedia Corpus , 2010, LREC.

[25] H. Brendan McMahan,et al. Learning Differentially Private Recurrent Language Models , 2017, ICLR.

[26] Takao Murakami,et al. Utility-Optimized Local Differential Privacy Mechanisms for Distribution Estimation , 2018, USENIX Security Symposium.

[27] Timothy Baldwin,et al. Towards Robust and Privacy-preserving Text Representations , 2018, ACL.

[28] Liu Yang,et al. Privacy-Adaptive BERT for Natural Language Understanding , 2021, ArXiv.

[29] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[30] Tom Diethe,et al. Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations , 2019, WSDM.

[31] Tom Diethe,et al. Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[32] Catuscia Palamidessi,et al. Geo-indistinguishability: differential privacy for location-based systems , 2012, CCS.

[33] Sherman S. M. Chow,et al. Learning Model with Error - Exposing the Hidden Model of BAYHENN , 2020, IJCAI.

[34] Somesh Jha,et al. An Attack on InstaHide: Is Private Learning Possible with Instance Encoding? , 2020, ArXiv.

[35] Lingjuan Lyu,et al. Differentially Private Representation for NLP: Formal Guarantee and An Empirical Study on Privacy and Fairness , 2020, FINDINGS.

[36] Cynthia Dwork,et al. Differential Privacy , 2006, ICALP.

[37] Lucien K. L. Ng,et al. GForce: GPU-Friendly Oblivious and Rapid Neural Network Inference , 2021, USENIX Security Symposium.

[38] Eyal Kushilevitz,et al. Falcon: Honest-Majority Maliciously Secure Framework for Private Deep Learning , 2021, Proc. Priv. Enhancing Technol..

[39] Vitaly Shmatikov,et al. Membership Inference Attacks Against Machine Learning Models , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[40] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[41] Kunal Talwar,et al. Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[42] Florian Kerschbaum,et al. SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining , 2018, SIGIR.

[43] Sherman S. M. Chow,et al. Goten: GPU-Outsourcing Trusted Execution of Neural Network Training , 2019, AAAI.

[44] Kai Li,et al. TextHide: Tackling Data Privacy for Language Understanding Tasks , 2020, FINDINGS.