Random projection-based multiplicative data perturbation for privacy preserving distributed data mining

This paper explores the possibility of using multiplicative random projection matrices for privacy preserving distributed data mining. It specifically considers the problem of computing statistical aggregates like the inner product matrix, correlation coefficient matrix, and Euclidean distance matrix from distributed privacy sensitive data possibly owned by multiple parties. This class of problems is directly related to many other data-mining problems such as clustering, principal component analysis, and classification. This paper makes primary contributions on two different grounds. First, it explores independent component analysis as a possible tool for breaching privacy in deterministic multiplicative perturbation-based models such as random orthogonal transformation and random rotation. Then, it proposes an approximate random projection-based technique to improve the level of privacy protection while still preserving certain statistical characteristics of the data. The paper presents extensive theoretical analysis and experimental results. Experiments demonstrate that the proposed technique is effective and can be successfully used for different types of privacy-preserving data mining applications.

[1]  Hillol Kargupta,et al.  Distributed Data Mining: Algorithms, Systems, and Applications , 2003 .

[2]  Qi Wang,et al.  On the privacy preserving properties of random data perturbation techniques , 2003, Third IEEE International Conference on Data Mining.

[3]  Stanley Robson de Medeiros Oliveira,et al.  Privacy preserving frequent itemset mining , 2002 .

[4]  Elisa Bertino,et al.  Association rule hiding , 2004, IEEE Transactions on Knowledge and Data Engineering.

[5]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[6]  Chris Clifton,et al.  Privacy-preserving k-means clustering over vertically partitioned data , 2003, KDD '03.

[7]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[8]  Dinesh Batra,et al.  Accessibility, security, and accuracy in statistical databases: the case for the multiplicative fixed data perturbation approach , 1995 .

[9]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[10]  Fabian J. Theis,et al.  Geometric overcomplete ICA , 2002, ESANN.

[11]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[12]  V. Koivunen,et al.  Identifiability and Separability of Linear Ica Models Revisited , 2003 .

[13]  S L Warner,et al.  Randomized response: a survey technique for eliminating evasive answer bias. , 1965, Journal of the American Statistical Association.

[14]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[15]  Alexandre V. Evfimievski,et al.  Limiting privacy breaches in privacy preserving data mining , 2003, PODS.

[16]  William E. Winkler,et al.  Multiplicative Noise for Masking Continuous Data , 2001 .

[17]  Andrew Chi-Chih Yao,et al.  Protocols for secure computations , 1982, FOCS 1982.

[18]  Kazue Sako,et al.  Efficient Receipt-Free Voting Based on Homomorphic Encryption , 2000, EUROCRYPT.

[19]  Chris Clifton,et al.  Privacy Preserving Naïve Bayes Classifier for Vertically Partitioned Data , 2004, SDM.

[20]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[21]  Rebecca N. Wright,et al.  Privacy-preserving Bayesian network structure computation on distributed heterogeneous data , 2004, KDD.

[22]  Salvatore J. Stolfo,et al.  JAM: Java Agents for Meta-Learning over Distributed Databases , 1997, KDD.

[23]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[24]  Silvio Micali,et al.  How to play ANY mental game , 1987, STOC.

[25]  Wenliang Du,et al.  Secure Multi-party Computational Geometry , 2001, WADS.

[26]  J. Demmel,et al.  Improved Error Bounds for Underdetermined System Solvers , 1993, SIAM J. Matrix Anal. Appl..

[27]  W. Härdle,et al.  Applied Multivariate Statistical Analysis , 2003 .

[28]  Yunghsiang Sam Han,et al.  Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification , 2004, SDM.

[29]  Chris Clifton,et al.  Tools for privacy preserving distributed data mining , 2002, SKDD.

[30]  A. Yao,et al.  Fair exchange with a semi-trusted third party (extended abstract) , 1997, CCS '97.

[31]  Hillol Kargupta,et al.  Distributed Data Mining Bibliography , 2004 .

[32]  Chris Clifton,et al.  Privacy-preserving clustering with distributed EM mixture modeling , 2004, Knowledge and Information Systems.

[33]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[34]  Michael Zibulevsky,et al.  Underdetermined blind source separation using sparse representations , 2001, Signal Process..

[35]  Chong K. Liew,et al.  A data distortion by probability distribution , 1985, TODS.

[36]  Wenliang Du,et al.  Secure multi-party computation problems and their applications: a review and open problems , 2001, NSPW '01.

[37]  Byung-Hoon Park,et al.  Collective Data Mining: A New Perspective Toward Distributed Data Analysis , 1999 .

[38]  Elisa Bertino,et al.  State-of-the-art in privacy preserving data mining , 2004, SGMD.

[39]  Michael Zibulevsky,et al.  Underdetermined blind source separation using sparse representations Pau Bo # lla , 2001 .

[40]  A. Rukhin Matrix Variate Distributions , 1999, The Multivariate Normal Distribution.

[41]  Josh Benaloh,et al.  One-Way Accumulators: A Decentralized Alternative to Digital Sinatures (Extended Abstract) , 1994, EUROCRYPT.

[42]  Hillol Kargupta,et al.  A Fourier spectrum-based approach to represent decision trees for mining data streams in mobile environments , 2004, IEEE Transactions on Knowledge and Data Engineering.

[43]  Wenliang Du,et al.  Building decision tree classifier on private data , 2002 .

[44]  Gu Si-yang,et al.  Privacy preserving association rule mining in vertically partitioned data , 2006 .

[45]  Vassilios S. Verykios,et al.  Disclosure limitation of sensitive rules , 1999, Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453).

[46]  Ruey-Wen Liu,et al.  General approach to blind source separation , 1996, IEEE Trans. Signal Process..

[47]  Jaideep Vaidya,et al.  Privacy Preserving Naive Bayes Classifier for Horizontally Partitioned Data , 2003 .

[48]  Benny Pinkas,et al.  Cryptographic techniques for privacy-preserving data mining , 2002, SKDD.

[49]  Stephen E. Fienberg,et al.  Data Swapping: Variations on a Theme by Dalenius and Reiss , 2004, Privacy in Statistical Databases.

[50]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[51]  Wenliang Du,et al.  Privacy-preserving cooperative scientific computations , 2001, Proceedings. 14th IEEE Computer Security Foundations Workshop, 2001..

[52]  Kun Liu,et al.  Communication efficient construction of decision trees over heterogeneously distributed data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[53]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[54]  Jayant R. Haritsa,et al.  A Framework for High-Accuracy Privacy-Preserving Mining , 2005, ICDE.

[55]  Hillol Kargupta,et al.  Privacy-sensitive Bayesian network parameter learning , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[56]  Chris Clifton,et al.  Privacy-preserving distributed mining of association rules on horizontally partitioned data , 2004, IEEE Transactions on Knowledge and Data Engineering.

[57]  Oded Goldreich,et al.  Foundations of Cryptography: Volume 2, Basic Applications , 2004 .

[58]  Moni Naor,et al.  Oblivious transfer and polynomial evaluation , 1999, STOC '99.

[59]  Wenliang Du,et al.  Privacy-preserving cooperative statistical analysis , 2001, Seventeenth Annual Computer Security Applications Conference.

[60]  Hoeteck Wee,et al.  Toward Privacy in Public Databases , 2005, TCC.

[61]  Santosh S. Vempala,et al.  An algorithmic theory of learning: Robust concepts and random projection , 1999, Machine Learning.

[62]  Ezio Lefons,et al.  An Analytic Approach to Statistical Databases , 1983, VLDB.

[63]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2000, Journal of Cryptology.

[64]  Joydeep Ghosh,et al.  Privacy-preserving distributed clustering using generative models , 2003, Third IEEE International Conference on Data Mining.

[65]  Ramakrishnan Srikant,et al.  Privacy-preserving data mining , 2000, SIGMOD '00.

[66]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[67]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[68]  Wenliang Du,et al.  Protocols for Secure Remote Database Access with Approximate Matching , 2001, E-Commerce Security and Privacy.

[69]  Fathi M. Salem,et al.  ALGEBRAIC OVERCOMPLETE INDEPENDENT COMPONENT ANALYSIS , 2003 .

[70]  Christian Jutten,et al.  On underdetermined source separation , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[71]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[72]  M. L. Eaton,et al.  The Non-Singularity of Generalized Sample Covariance Matrices , 1973 .

[73]  Chris Clifton,et al.  Using unknowns to prevent discovery of association rules , 2001, SGMD.

[74]  Jayant R. Haritsa,et al.  Maintaining Data Privacy in Association Rule Mining , 2002, VLDB.

[75]  Ira S. Moskowitz,et al.  Parsimonious downgrading and decision trees applied to the inference problem , 1998, NSPW '98.

[76]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[77]  Emre Telatar,et al.  Capacity of Multi-antenna Gaussian Channels , 1999, Eur. Trans. Telecommun..