Privacy concerns over the proliferation of gathering of personal information by various institutions over the internet led to the development of data mining algorithms that preserve the privacy of those whose personal data are collected and analyzed. A novel approach to such privacy preserving data mining algorithms was proposed where the individual datum in a data set is perturbed by adding a random value from a known distribution. In these applications, the distribution of the original data set is important and estimating it is one of the goals of the data mining algorithm. This distribution is estimated via an iterative algorithm such as the Expectation Maximization (EM) algorithm which was shown to have desirable properties such as low privacy loss and high fidelity estimates of the distribution. Each iteration of EM requires computation that is proportional to the size of the data set and can require large computation time to estimate the distribution. In this paper we propose two ways to reduce the amount of computation. First, we show that the problem can be recast as a deconvolution problem and signal processing algorithms can be applied to solve this problem. In particular we consider both a direct method and iterative methods which are more robust against noise and ill-conditioning. We show that the Richardson-Lucy deblurring algorithm is equivalent to EM after quantization. The signal processing approach also shows how the choice of perturbation affects information loss and privacy loss and allows us to clarify some points made in the literature. In the second part of this paper, we propose a scheme for perturbing data which also has the nice properties of arbitrarily small privacy loss and arbitrarily high fidelity in the estimate. The main advantage of the proposed scheme is the simplicity of the estimation algorithm. In contrast to iterative algorithms such as EM, the proposed scheme estimates the unknown distribution in one step. This is significant in applications where the data set is very large or when the data mining algorithm is run in an online environment.
[1]
Jorge Herbert de Lira,et al.
Two-Dimensional Signal and Image Processing
,
1989
.
[2]
Ramakrishnan Srikant,et al.
Privacy-preserving data mining
,
2000,
SIGMOD '00.
[3]
Amit Sahai,et al.
Manipulating statistical difference
,
1997,
Randomization Methods in Algorithm Design.
[4]
Matthias Pruksch,et al.
Positive iterative deconvolution with energy conservation
,
1998
.
[5]
Charu C. Aggarwal,et al.
On the design and quantification of privacy preserving data mining algorithms
,
2001,
PODS.
[6]
B. Silverman.
Density estimation for statistics and data analysis
,
1986
.
[7]
Chris Clifton,et al.
Privacy-preserving distributed mining of association rules on horizontally partitioned data
,
2004,
IEEE Transactions on Knowledge and Data Engineering.
[8]
Joseph A. O'Sullivan,et al.
Deblurring subject to nonnegativity constraints
,
1992,
IEEE Trans. Signal Process..
[9]
S. Schwartz.
Estimation of Probability Density by an Orthogonal Series
,
1967
.
[10]
Wenliang Du,et al.
Privacy-preserving cooperative scientific computations
,
2001,
Proceedings. 14th IEEE Computer Security Foundations Workshop, 2001..
[11]
Bernard W. Silverman,et al.
Density Estimation for Statistics and Data Analysis
,
1987
.