Differential Privacy for Text Analytics via Natural Text Sanitization

Texts convey sophisticated knowledge. However, texts also convey sensitive information. Despite the success of general-purpose language models and domain-specific mechanisms with differential privacy (DP), existing text sanitization mechanisms still provide low utility, as cursed by the high-dimensional text representation. The companion issue of utilizing sanitized texts for downstream analytics is also under-explored. This paper takes a direct approach to text sanitization. Our insight is to consider both sensitivity and similarity via our new local DP notion. The sanitized texts also contribute to our sanitization-aware pretraining and fine-tuning, enabling privacypreserving natural language processing over the BERT language model with promising utility. Surprisingly, the high utility does not boost up the success rate of inference attacks.

[1]  Yoav Goldberg,et al.  Adversarial Removal of Demographic Attributes from Text Data , 2018, EMNLP.

[2]  Ninghui Li,et al.  Locally Differentially Private Protocols for Frequency Estimation , 2017, USENIX Security Symposium.

[3]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[4]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[5]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[6]  Divesh Srivastava,et al.  Marginal Release Under Local Differential Privacy , 2017, SIGMOD Conference.

[7]  L. Sweeney Only You, Your Doctor, and Many Others May Know , 2015 .

[8]  Geoffrey C. Fox,et al.  Glyph: Fast and Accurately Training Deep Neural Networks on Encrypted Data , 2019, NeurIPS.

[9]  Martin J. Wainwright,et al.  Local privacy and statistical minimax rates , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[10]  Dan Boneh,et al.  Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware , 2018, ICLR.

[11]  Hongfang Liu,et al.  MedSTS: a resource for clinical semantic textual similarity , 2018, Language Resources and Evaluation.

[12]  Ninghui Li,et al.  Locally Differentially Private Frequent Itemset Mining , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[13]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[14]  Catuscia Palamidessi,et al.  Broadening the Scope of Differential Privacy Using Metrics , 2013, Privacy Enhancing Technologies.

[15]  Congzheng Song,et al.  Information Leakage in Embedding Models , 2020, CCS.

[16]  Yitong Li,et al.  Towards Differentially Private Text Representations , 2020, SIGIR.

[17]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[18]  G. Crooks On Measures of Entropy and Information , 2015 .

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Mi Zhang,et al.  Privacy Risks of General-Purpose Language Models , 2020, 2020 IEEE Symposium on Security and Privacy (SP).

[21]  Graham Neubig,et al.  Controllable Invariance through Adversarial Feature Learning , 2017, NIPS.

[22]  M. Alvim,et al.  Local Differential Privacy on Metric Spaces : optimizing the trade-off with utility ( Invited Paper ) , 2018 .

[23]  Shashi Narayan,et al.  Privacy-preserving Neural Representations of Text , 2018, EMNLP.

[24]  Gemma Boleda,et al.  Wikicorpus: A Word-Sense Disambiguated Multilingual Wikipedia Corpus , 2010, LREC.

[25]  H. Brendan McMahan,et al.  Learning Differentially Private Recurrent Language Models , 2017, ICLR.

[26]  Takao Murakami,et al.  Utility-Optimized Local Differential Privacy Mechanisms for Distribution Estimation , 2018, USENIX Security Symposium.

[27]  Timothy Baldwin,et al.  Towards Robust and Privacy-preserving Text Representations , 2018, ACL.

[28]  Liu Yang,et al.  Privacy-Adaptive BERT for Natural Language Understanding , 2021, ArXiv.

[29]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[30]  Tom Diethe,et al.  Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations , 2019, WSDM.

[31]  Tom Diethe,et al.  Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[32]  Catuscia Palamidessi,et al.  Geo-indistinguishability: differential privacy for location-based systems , 2012, CCS.

[33]  Sherman S. M. Chow,et al.  Learning Model with Error - Exposing the Hidden Model of BAYHENN , 2020, IJCAI.

[34]  Somesh Jha,et al.  An Attack on InstaHide: Is Private Learning Possible with Instance Encoding? , 2020, ArXiv.

[35]  Lingjuan Lyu,et al.  Differentially Private Representation for NLP: Formal Guarantee and An Empirical Study on Privacy and Fairness , 2020, FINDINGS.

[36]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[37]  Lucien K. L. Ng,et al.  GForce: GPU-Friendly Oblivious and Rapid Neural Network Inference , 2021, USENIX Security Symposium.

[38]  Eyal Kushilevitz,et al.  Falcon: Honest-Majority Maliciously Secure Framework for Private Deep Learning , 2021, Proc. Priv. Enhancing Technol..

[39]  Vitaly Shmatikov,et al.  Membership Inference Attacks Against Machine Learning Models , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[40]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[41]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[42]  Florian Kerschbaum,et al.  SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining , 2018, SIGIR.

[43]  Sherman S. M. Chow,et al.  Goten: GPU-Outsourcing Trusted Execution of Neural Network Training , 2019, AAAI.

[44]  Kai Li,et al.  TextHide: Tackling Data Privacy for Language Understanding Tasks , 2020, FINDINGS.