论文信息 - On Privacy Protection of Latent Dirichlet Allocation Model Training

On Privacy Protection of Latent Dirichlet Allocation Model Training

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for discovery of hidden semantic architecture of text datasets, and plays a fundamental role in many machine learning applications. However, like many other machine learning algorithms, the process of training a LDA model may leak the sensitive information of the training datasets and bring significant privacy risks. To mitigate the privacy issues in LDA, we focus on studying privacy-preserving algorithms of LDA model training in this paper. In particular, we first develop a privacy monitoring algorithm to investigate the privacy guarantee obtained from the inherent randomness of the Collapsed Gibbs Sampling (CGS) process in a typical LDA training algorithm on centralized curated datasets. Then, we further propose a locally private LDA training algorithm on crowdsourced data to provide local differential privacy for individual data contributors. The experimental results on real-world datasets demonstrate the effectiveness of our proposed algorithms.

Shusen Yang | Xinyu Yang | Fangyuan Zhao | Xuebin Ren

[1] Cynthia Dwork,et al. Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[2] Nesime Tatbul,et al. Proceedings of the VLDB Endowment , 2011 .

[3] Anand D. Sarwate,et al. Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[4] Shusen Yang,et al. Impact of Prior Knowledge and Data Correlation on Privacy Leakage: A Unified Analysis , 2019, IEEE Transactions on Information Forensics and Security.

[5] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6] James R. Foulds,et al. On the Theory and Practice of Privacy-Preserving Bayesian Data Analysis , 2016, UAI.

[7] Alexander J. Smola,et al. An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[8] Alexander J. Smola,et al. Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo , 2015, ICML.

[9] Tie-Yan Liu,et al. LightLDA: Big Topic Models on Modest Computer Clusters , 2014, WWW.

[10] Ian Goodfellow,et al. Deep Learning with Differential Privacy , 2016, CCS.

[11] J. Meigs,et al. WHO Technical Report , 1954, The Yale Journal of Biology and Medicine.

[12] James R. Foulds,et al. Variational Bayes In Private Settings (VIPS) , 2016, J. Artif. Intell. Res..

[13] Bin Cui,et al. LDA*: A Robust and Large-scale Topic Modeling System , 2017, Proc. VLDB Endow..

[14] Philip S. Yu,et al. $\textsf{LoPub}$ : High-Dimensional Crowdsourced Data Publication With Local Differential Privacy , 2016, IEEE Transactions on Information Forensics and Security.

[15] Ninghui Li,et al. Differential Privacy: From Theory to Practice , 2016, Differential Privacy.

[16] Vitaly Shmatikov,et al. Membership Inference Attacks Against Machine Learning Models , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[17] Aaron Roth,et al. The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[18] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[19] E. Silerova,et al. Knowledge and information systems , 2018 .

[20] Somesh Jha,et al. Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing , 2014, USENIX Security Symposium.