Analysis of and techniques for privacy preserving data mining

Privacy is often considered as a social, moral or legal concept. As Internet and e-commerce have prospered nowadays, privacy has become one of the most important issues in IT and has received increasing attention from enterprises, consumers and legislators. Although various techniques, such as randomization-based methods, cryptographic-based methods, and database inference control etc. have been developed, many key problems still remain open in this area. Especially, new privacy and security issues have been identified, and the scope of the privacy has been expanded. An essential problem under the context is tradeoffs between the data utility and the disclosure risk. Since previous research only conducted empirical evaluations or limited analysis for existing randomization techniques, a more solid theoretical analysis is needed. This dissertation investigates different perturbation models in randomization-based privacy preserving data mining. Among them, the additive-noise-based model and the projection-based-model are primary tools. For the additive-noise-based perturbation, the explicit relation between noise and mining accuracy has not been carefully studied. We first propose an improved strategy to reconstruct the data based on the representative method. Then we develop explicit bounds of reconstruction error. Both the upper bound and the lower bound provide a guideline to balance the privacy/accuracy tradeoff. We also discuss other potential threats to the privacy based on our defined measure for quantifying the privacy. For the projection-based perturbation, properties of different models and possible disclosures within those models are analyzed in detail. Particularly, we propose an A-priori Knowledge-based ICA attack (AK-ICA) which is effective against all the existing projection models. Due to the vulnerabilities in previous randomization models, a general-location-model-based approach is proposed. It first builds a statistical model to fit the real data with both categorical and numerical types of variables, then generates a synthetic data set for mining by tuning parameters of the model instead of perturbing particular individual values. Since the search space of parameters of the model is much smaller than that of data and all information which attackers can derive is contained in those parameters, this approach is expected to be more effective and efficient. This dissertation investigates privacy issues of the numerical data in this model, wherein the disclosure is analyzed and controlled in different scenarios.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[3]  Keke Chen,et al.  Privacy preserving data classification with rotation perturbation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[4]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[5]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[6]  Kun Liu,et al.  Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[7]  Sophie Tarbouriech,et al.  LMI approximations for the radius of the intersection of ellipsoids , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[8]  Silvio Micali,et al.  How to play ANY mental game , 1987, STOC.

[9]  Peter Buneman,et al.  Semistructured data , 1997, PODS.