TEM: High Utility Metric Differential Privacy on Text

Ensuring the privacy of users whose data are used to train Natural Language Processing (NLP) models is necessary to build and maintain customer trust. Differential Privacy (DP) has emerged as the most successful method to protect the privacy of individuals. However, applying DP to the NLP domain comes with unique challenges. The most successful previous methods use a generalization of DP for metric spaces, and apply the privatization by adding noise to inputs in the metric space of word embeddings. However, these methods assume that one specific distance measure is being used, ignore the density of the space around the input, and assume the embeddings used have been trained on non-sensitive data. In this work we propose Truncated Exponential Mechanism (TEM), a general method that allows the privatization of words using any distance metric, on embeddings that can be trained on sensitive data. Our method makes use of the exponential mechanism to turn the privatization step into a selection problem. This allows the noise applied to be calibrated to the density of the embedding space around the input, and makes domain adaptation possible for the embeddings. In our experiments, we demonstrate that our method significantly outperforms the state-of-the-art in terms of utility for the same level of privacy, while providing more flexibility in the metric selection.

[1]  Yuan Luo,et al.  Clinical text classification with rule-based features and knowledge-guided convolutional neural networks , 2018, 2018 IEEE International Conference on Healthcare Informatics Workshop (ICHI-W).

[2]  Ryan M. Rogers,et al.  Practical Differentially Private Top-k Selection with Pay-what-you-get Composition , 2019, NeurIPS.

[3]  Vitaly Shmatikov,et al.  Membership Inference Attacks Against Machine Learning Models , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[4]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[6]  Lili Jiang,et al.  dpUGC: Learn Differentially Private Representation for User Generated Contents , 2019, ArXiv.

[7]  Alessandro Moschitti,et al.  Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction , 2013, ACL.

[8]  H. Brendan McMahan,et al.  Learning Differentially Private Recurrent Language Models , 2017, ICLR.

[9]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[10]  Tom Diethe,et al.  Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[11]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[12]  Larry P. Heck,et al.  Domain Adaptation of Recurrent Neural Networks for Natural Language Understanding , 2016, INTERSPEECH.

[13]  Chris Clifton,et al.  When do data mining results violate privacy? , 2004, KDD.

[14]  Catuscia Palamidessi,et al.  Broadening the Scope of Differential Privacy Using Metrics , 2013, Privacy Enhancing Technologies.

[15]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[16]  Benjamin C. M. Fung,et al.  ER-AE: Differentially Private Text Generation for Authorship Anonymization , 2019, NAACL.

[17]  Hubert Eichner,et al.  Federated Learning for Mobile Keyboard Prediction , 2018, ArXiv.