Toward Distribution Estimation under Local Differential Privacy with Small Samples

Abstract A number of studies have recently been made on discrete distribution estimation in the local model, in which users obfuscate their personal data (e.g., location, response in a survey) by themselves and a data collector estimates a distribution of the original personal data from the obfuscated data. Unlike the centralized model, in which a trusted database administrator can access all users’ personal data, the local model does not suffer from the risk of data leakage. A representative privacy metric in this model is LDP (Local Differential Privacy), which controls the amount of information leakage by a parameter ∈ called privacy budget. When ∈ is small, a large amount of noise is added to the personal data, and therefore users’ privacy is strongly protected. However, when the number of users ℕ is small (e.g., a small-scale enterprise may not be able to collect large samples) or when most users adopt a small value of ∈, the estimation of the distribution becomes a very challenging task. The goal of this paper is to accurately estimate the distribution in the cases explained above. To achieve this goal, we focus on the EM (Expectation-Maximization) reconstruction method, which is a state-of-the-art statistical inference method, and propose a method to correct its estimation error (i.e., difference between the estimate and the true value) using the theory of Rilstone et al. We prove that the proposed method reduces the MSE (Mean Square Error) under some assumptions.We also evaluate the proposed method using three largescale datasets, two of which contain location data while the other contains census data. The results show that the proposed method significantly outperforms the EM reconstruction method in all of the datasets when ℕ or ∈ is small.

[1]  Chris Clifton,et al.  Privacy-Preserving Data Mining , 2006, Encyclopedia of Database Systems.

[2]  Yang Zhang,et al.  CarTel: a distributed mobile sensor computing system , 2006, SenSys '06.

[3]  Úlfar Erlingsson,et al.  RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , 2014, CCS.

[4]  Kôiti Hasida,et al.  Inferring Long-term User Properties Based on Users' Location History , 2007, IJCAI.

[5]  Michael Gastpar,et al.  Locally differentially-private distribution estimation , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[6]  Aman Ullah,et al.  The second-order bias and mean squared error of nonlinear estimators , 1996 .

[7]  Rinku Dewri,et al.  Location Privacy for Rank-based Geo-Query Systems , 2017, Proc. Priv. Enhancing Technol..

[8]  Shashi Shekhar,et al.  Spatial big-data challenges intersecting mobility and cloud computing , 2012, MobiDE '12.

[9]  Xiao-Li Meng,et al.  POSTERIOR PREDICTIVE ASSESSMENT OF MODEL FITNESS VIA REALIZED DISCREPANCIES , 1996 .

[10]  Peter Kairouz,et al.  Discrete Distribution Estimation under Local Privacy , 2016, ICML.

[11]  Hans-Peter Kriegel,et al.  Generalized Outlier Detection with Flexible Kernel Density Estimates , 2014, SDM.

[12]  Viktor K. Prasanna,et al.  Big data analytics for demand response: Clustering over space and time , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[13]  Catuscia Palamidessi,et al.  Geo-indistinguishability: differential privacy for location-based systems , 2012, CCS.

[14]  Emiliano De Cristofaro,et al.  What Does The Crowd Say About You? Evaluating Aggregation-based Location Privacy , 2017, Proc. Priv. Enhancing Technol..

[15]  Miguel Á. Carreira-Perpiñán,et al.  Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application , 2013, ArXiv.

[16]  C. W. Groetsch,et al.  The theory of Tikhonov regularization for Fredholm equations of the first kind , 1984 .

[17]  Pramod Viswanath,et al.  Extremal Mechanisms for Local Differential Privacy , 2014, J. Mach. Learn. Res..

[18]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[19]  David Lazer,et al.  Inferring friendship network structure by using mobile phone data , 2009, Proceedings of the National Academy of Sciences.

[20]  Liam McNamara,et al.  SpotME If You Can: Randomized Responses for Location Obfuscation on Mobile Phones , 2011, 2011 31st International Conference on Distributed Computing Systems.

[21]  A. V. D. Vaart Asymptotic Statistics: Delta Method , 1998 .

[22]  Daqing Zhang,et al.  Participatory Cultural Mapping Based on Collective Behavior Data in Location-Based Social Networks , 2016, ACM Trans. Intell. Syst. Technol..

[23]  Andreas Haeberlen,et al.  Differential Privacy: An Economic Method for Choosing Epsilon , 2014, 2014 IEEE 27th Computer Security Foundations Symposium.

[24]  Ninghui Li,et al.  Differential Privacy: From Theory to Practice , 2016, Differential Privacy.

[25]  Yasuhiro Hayashi,et al.  A Versatile Clustering Method for Electricity Consumption Pattern Analysis in Households , 2013, IEEE Transactions on Smart Grid.

[26]  Catuscia Palamidessi,et al.  Optimal Geo-Indistinguishable Mechanisms for Location Privacy , 2014, CCS.

[27]  Stephen B. Wicker,et al.  Inferring Personal Information from Demand-Response Systems , 2010, IEEE Security & Privacy.

[28]  Yin Yang,et al.  Heavy Hitter Estimation over Set-Valued Data with Local Differential Privacy , 2016, CCS.

[29]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[30]  Thomas M. Cover,et al.  Elements of Information Theory: Cover/Elements of Information Theory, Second Edition , 2005 .

[31]  Wenliang Du,et al.  OptRR: Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[32]  Analía Amandi,et al.  Intelligent User Profiling , 2009, Artificial Intelligence: An International Perspective.

[33]  Ting Yu,et al.  Conservative or liberal? Personalized differential privacy , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[34]  Ernesto Damiani,et al.  A Discussion of Privacy Challenges in User Profiling with Big Data Techniques: The EEXCESS Use Case , 2013, 2013 IEEE International Congress on Big Data.

[35]  Martin J. Wainwright,et al.  Local privacy and statistical minimax rates , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[36]  G. Kitagawa,et al.  Bootstrapping Log Likelihood and EIC, an Extension of AIC , 1997 .

[37]  Thomas J. McAvoy,et al.  Contemplative stance for chemical process control : An IFAC report , 1992, at - Automatisierungstechnik.

[38]  Yong Bao,et al.  The Second-Order Bias and Mean Squared Error of Estimators in Time Series Models , 2007 .

[39]  S L Warner,et al.  Randomized response: a survey technique for eliminating evasive answer bias. , 1965, Journal of the American Statistical Association.

[40]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[41]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[42]  Ramakrishnan Srikant,et al.  Privacy preserving OLAP , 2005, SIGMOD '05.

[43]  Tor Arne Johansen,et al.  On Tikhonov regularization, bias and variance in nonlinear system identification , 1997, Autom..

[44]  Catuscia Palamidessi,et al.  Efficient Utility Improvement for Location Privacy , 2017, Proc. Priv. Enhancing Technol..

[45]  Akihiko Ohsuga,et al.  Differential Private Data Collection and Analysis Based on Randomized Multiple Dummies for Untrusted Mobile Crowdsensing , 2017, IEEE Transactions on Information Forensics and Security.

[46]  Reza Shokri,et al.  Evaluating the Privacy Risk of Location-Based Services , 2011, Financial Cryptography.

[47]  Yoshihide Sekimoto,et al.  PFlow: Reconstructing People Flow Recycling Large-Scale Social Survey Data , 2011, IEEE Pervasive Computing.

[48]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[49]  Xing Xie,et al.  Mining interesting locations and travel sequences from GPS trajectories , 2009, WWW '09.

[50]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.