Inferences on Kernel Density Estimates by Solving Nonlinear Systems

Kernel density estimators are a popular family of non-parametric estimators with applications to exploratory statistics and data mining. Since kernel estimators must be constructed from the data, if the data are sensitive, only indirect representations of the estimate, such as graphs or tabulations, can be stored or transmitted. However, even such representations might contain enough information to allow for data reconstruction, yielding an inference problem for kernel estimates. The inference problem for kernel estimators can be described by a system of nonlinear equations that arises naturally from the kernel estimate of a multi-variate dataset. The solution to the system is the set of data from which the kernel estimate was computed and, in practice a good approximation to the solution is not available. A serious threat to data privacy is posed by publicly available solvers for nonlinear systems. This paper investigates the numerical solution of the nonlinear systems arising from the kernel estimate of a multivariate dataset and shows that this task is challenging. In fact, the Jacobian matrix of the system is numerically singular and a large number of solvers for nonlinear equations fails as they have to solve linear systems whose coefficient matrix is given by the Jacobian. Further, up to date solvers for optimization problems that do not suffer from this drawback may fail to solve the nonlinear system. To show this fact, we tested a subspace trust-region method, a BFGS method and a gradient projection method on both a synthetic and a real dataset. These methods are able to find a solution to the optimization problem even starting far from it. However, the experimental results on both the synthetic and the real dataset show that, if the initial guess is not very close to the solution, all three methods fail to converge to a solution of the system of equations. Then, unless a very good approximation of the solution is known, the dataset cannot be reconstructed by using publicly available solvers

[1]  Stefania Bellavia,et al.  STRSCNE: A Scaled Trust-Region Solver for Constrained Nonlinear Equations , 2004, Comput. Optim. Appl..

[2]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  Sushil Jajodia,et al.  The inference problem: a survey , 2002, SKDD.

[5]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[6]  Jeffrey S. Simonoff,et al.  A Casebook for a First Course in Statistics and Data Analysis. , 1995 .

[7]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[8]  Karen A. F. Copeland,et al.  A Casebook for a First Course in Statistics and Data Analysis , 1996 .

[9]  P. Pardalos,et al.  Handbook of global optimization , 1995 .

[10]  Thomas F. Coleman,et al.  An Interior Trust Region Approach for Nonlinear Minimization Subject to Bounds , 1993, SIAM J. Optim..

[11]  Nicholas I. M. Gould,et al.  Trust Region Methods , 2000, MOS-SIAM Series on Optimization.

[12]  Matthias Klusch,et al.  Inference Attacks in Peer-to-Peer Homogeneous Distributed Data Mining , 2004, ECAI.

[13]  Thomas F. Coleman,et al.  On the convergence of interior-reflective Newton methods for nonlinear minimization subject to bounds , 1994, Math. Program..

[14]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[15]  Matthias Klusch,et al.  Distributed Clustering Based on Sampling Local Density Estimates , 2003, IJCAI.

[16]  Alexandre V. Evfimievski,et al.  Randomization in privacy preserving data mining , 2002, SKDD.

[17]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[18]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[19]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[20]  Carl Tim Kelley,et al.  Iterative methods for optimization , 1999, Frontiers in applied mathematics.