dpUGC: Learn Differentially Private Representation for User Generated Contents

This paper firstly proposes a simple yet efficient generalized approach to apply differential privacy to text representation (i.e., word embedding). Based on it, we propose a user-level approach to learn personalized differentially private word embedding model on user generated contents (UGC). To our best knowledge, this is the first work of learning user-level differentially private word embedding model from text for sharing. The proposed approaches protect the privacy of the individual from re-identification, especially provide better trade-off of privacy and data utility on UGC data for sharing. The experimental results show that the trained embedding models are applicable for the classic text analysis tasks (e.g., regression). Moreover, the proposed approaches of learning differentially private embedding models are both framework- and data- independent, which facilitates the deployment and sharing. The source code is available at this https URL.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[3]  Kevin Gimpel,et al.  Tailoring Continuous Word Representations for Dependency Parsing , 2014, ACL.

[4]  Lili Jiang,et al.  Self-adaptive Privacy Concern Detection for User-generated Content , 2018, CICLing.

[5]  Iryna Gurevych,et al.  Can We Hide in the Web? Large Scale Simultaneous Age and Gender Author Profiling in Social Media Notebook for PAN at CLEF 2013 , 2013, CLEF.

[6]  Zhou Li,et al.  Privacy-preserving genomic computation through program specialization , 2009, CCS.

[7]  Dejing Dou,et al.  Adaptive Laplace Mechanism: Differential Privacy Preservation in Deep Learning , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[8]  Matt Taddy,et al.  Document Classification by Inversion of Distributed Language Representations , 2015, ACL.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Irina Piontkovskaya,et al.  Distributed Fine-tuning of Language Models on Private Data , 2018, ICLR.

[11]  Michaël Rusinowitch,et al.  Detecting Communities under Differential Privacy , 2016, WPES@CCS.

[12]  Chris Clifton,et al.  How Much Is Enough? Choosing ε for Differential Privacy , 2011, ISC.

[13]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[14]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[15]  Chris Clifton,et al.  Differential identifiability , 2012, KDD.

[16]  Zhenyu Wu,et al.  Towards Privacy-Preserving Visual Recognition via Adversarial Training: A Pilot Study , 2018, ECCV.

[17]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[18]  Ye Zhang,et al.  SHAPED: Shared-Private Encoder-Decoder for Text Style Adaptation , 2018, NAACL.

[19]  Nagia M. Ghanem,et al.  Author Identification Using Deep Learning , 2016, 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA).

[20]  S. Gosling,et al.  Facebook as a research tool for the social sciences: Opportunities, challenges, ethical considerations, and practical guidelines. , 2015, The American psychologist.

[21]  Erik Elmroth,et al.  Personality-based Knowledge Extraction for Privacy-preserving Data Analysis , 2017, K-CAP.

[22]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[23]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[24]  David Sands,et al.  Differential Privacy , 2015, POPL.

[25]  Somesh Jha,et al.  Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures , 2015, CCS.

[26]  Úlfar Erlingsson,et al.  Scalable Private Learning with PATE , 2018, ICLR.

[27]  Cynthia Dwork,et al.  Differential Privacy for Statistics: What we Know and What we Want to Learn , 2010, J. Priv. Confidentiality.

[28]  Nematollah Batmanghelich,et al.  Nonparametric Spherical Topic Modeling with Word Embeddings , 2016, ACL.

[29]  Li Zhang,et al.  Learning Differentially Private Language Models Without Losing Accuracy , 2017, ArXiv.