A Differentially Private Text Perturbation Method Using Regularized Mahalanobis Metric

Balancing the privacy-utility tradeoff is a crucial requirement of many practical machine learning systems that deal with sensitive customer data. A popular approach for privacy-preserving text analysis is noise injection, in which text data is first mapped into a continuous embedding space, perturbed by sampling a spherical noise from an appropriate distribution, and then projected back to the discrete vocabulary space. While this allows the perturbation to admit the required metric differential privacy, often the utility of downstream tasks modeled on this perturbed data is low because the spherical noise does not account for the variability in the density around different words in the embedding space. In particular, words in a sparse region are likely unchanged even when the noise scale is large. %Using the global sensitivity of the mechanism can potentially add too much noise to the words in the dense regions of the embedding space, causing a high utility loss, whereas using local sensitivity can leak information through the scale of the noise added. In this paper, we propose a text perturbation mechanism based on a carefully designed regularized variant of the Mahalanobis metric to overcome this problem. For any given noise scale, this metric adds an elliptical noise to account for the covariance structure in the embedding space. This heterogeneity in the noise scale along different directions helps ensure that the words in the sparse region have sufficient likelihood of replacement without sacrificing the overall utility. We provide a text-perturbation algorithm based on this metric and formally prove its privacy guarantees. Additionally, we empirically show that our mechanism improves the privacy statistics to achieve the same level of utility as compared to the state-of-the-art Laplace mechanism.

[1]  Larry A. Wasserman,et al.  Differential privacy for functions and functional data , 2012, J. Mach. Learn. Res..

[2]  L Sweeney,et al.  Weaving Technology and Policy Together to Maintain Confidentiality , 1997, Journal of Law, Medicine & Ethics.

[3]  Jun Sakuma,et al.  Differentially Private Analysis of Outliers , 2015, ECML/PKDD.

[4]  Mario Fritz,et al.  ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models , 2018, NDSS.

[5]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[6]  Michael I. Jordan,et al.  Genomic privacy and limits of individual detection in a pool , 2009, Nature Genetics.

[7]  Rob Hall,et al.  New Statistical Applications for Differential Privacy , 2013 .

[8]  Thomas Steinke,et al.  Robust Traceability from Trace Amounts , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[9]  Balamurugan Anandan,et al.  t-Plausibility: Generalizing Words to Desensitize Text , 2012, Trans. Data Priv..

[10]  Tom Diethe,et al.  Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations , 2019, WSDM.

[11]  David Sánchez,et al.  Knowledge-based scheme to create privacy-preserving but semantically-related queries for web search engines , 2013, Inf. Sci..

[12]  Vitaly Shmatikov,et al.  Membership Inference Attacks Against Machine Learning Models , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[13]  Catuscia Palamidessi,et al.  Constructing elastic distinguishability metrics for location privacy , 2015, Proc. Priv. Enhancing Technol..

[14]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[15]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[16]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.

[17]  Ghazaleh Beigi,et al.  I Am Not What I Write: Privacy Preserving Text Representation Learning , 2019, ArXiv.

[18]  Bo Du,et al.  A Low-Rank and Sparse Matrix Decomposition-Based Mahalanobis Distance Method for Hyperspectral Anomaly Detection , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[19]  Matthew R. McKay,et al.  Large dimensional analysis and optimization of robust shrinkage covariance matrix estimators , 2014, J. Multivar. Anal..

[20]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[21]  Hovav Shacham,et al.  On the (In)effectiveness of Mosaicing and Blurring as Tools for Document Redaction , 2016, Proc. Priv. Enhancing Technol..

[22]  Catuscia Palamidessi,et al.  Geo-indistinguishability: differential privacy for location-based systems , 2012, CCS.

[23]  Lionel Brunie,et al.  PEAS: Private, Efficient and Accurate Web Search , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[24]  David Sánchez,et al.  C‐sanitized: A privacy model for document redaction and sanitization , 2014, J. Assoc. Inf. Sci. Technol..

[25]  Rayid Ghani,et al.  A Machine Learning Based System for Semi-Automatically Redacting Documents , 2011, IAAI.

[26]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[27]  Xuhua Ding,et al.  Embellishing text search queries to protect user privacy , 2010, Proc. VLDB Endow..

[28]  Saiful Islam,et al.  Mahalanobis Distance , 2009, Encyclopedia of Biometrics.

[29]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[30]  Feiping Nie,et al.  Learning a Mahalanobis distance metric for data clustering and classification , 2008, Pattern Recognit..

[31]  Tom Diethe,et al.  Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[32]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[33]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[34]  Yu Li,et al.  Mahalanobis distance based on fuzzy clustering algorithm for image segmentation , 2015, Digit. Signal Process..

[35]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[36]  R. Kass,et al.  Shrinkage Estimators for Covariance Matrices , 2001, Biometrics.

[37]  Catuscia Palamidessi,et al.  Broadening the Scope of Differential Privacy Using Metrics , 2013, Privacy Enhancing Technologies.

[38]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[39]  Rik Warren,et al.  Use of Mahalanobis Distance for Detecting Outliers and Outlier Clusters in Markedly Non-Normal Data: A Vehicular Traffic Example , 2011 .

[40]  Annabelle McIver,et al.  Generalised Differential Privacy for Text Document Processing , 2018, POST.

[41]  Josep Domingo-Ferrer,et al.  H(k)-private Information Retrieval from Privacy-uncooperative Queryable Databases.">h(k)-private Information Retrieval from Privacy-uncooperative Queryable Databases , 2009, Online Inf. Rev..

[42]  Sofya Raskhodnikova,et al.  What Can We Learn Privately? , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.