A privacy preserving technique for distance-based classification with worst case privacy guarantees

There has been relatively little work on privacy preserving techniques for distance based mining. The most widely used ones are additive perturbation methods and orthogonal transform based methods. These methods concentrate on privacy protection in the average case and provide no worst case privacy guarantee. However, the lack of privacy guarantee makes it difficult to use these techniques in practice, and causes possible privacy breach under certain attacking methods. This paper proposes a novel privacy protection method for distance based mining algorithms that gives worst case privacy guarantees and protects the data against correlation-based and transform-based attacks. This method has the following three novel aspects. First, this method uses a framework to provide theoretical bound of privacy breach in the worst case. This framework provides easy to check conditions that one can determine whether a method provides worst case guarantee. A quick examination shows that special types of noise such as Laplace noise provide worst case guarantee, while most existing methods such as adding normal or uniform noise, as well as random projection method do not provide worst case guarantee. Second, the proposed method combines the favorable features of additive perturbation and orthogonal transform methods. It uses principal component analysis to decorrelate the data and thus guards against attacks based on data correlations. It then adds Laplace noise to guard against attacks that can recover the PCA transform. Third, the proposed method improves accuracy of one of the popular distance-based classification algorithms: K-nearest neighbor classification, by taking into account the degree of distance distortion introduced by sanitization. Extensive experiments demonstrate the effectiveness of the proposed method.

[1]  Wenliang Du,et al.  Using randomized response techniques for privacy-preserving data mining , 2003, KDD '03.

[2]  Keke Chen,et al.  Privacy preserving data classification with rotation perturbation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[3]  Kun Liu,et al.  Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[4]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[5]  Alexandre V. Evfimievski,et al.  Limiting privacy breaches in privacy preserving data mining , 2003, PODS.

[6]  Aryya Gangopadhyay,et al.  A privacy-preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms , 2006, The VLDB Journal.

[7]  Chris Clifton,et al.  Privacy-preserving k-means clustering over vertically partitioned data , 2003, KDD '03.

[8]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  Kun Liu,et al.  An Attacker's View of Distance Preserving Maps for Privacy Preserving Data Mining , 2006, PKDD.

[10]  Ling Liu,et al.  A Random Rotation Perturbation Approach to Privacy Preserving Data Classification , 2005 .

[11]  William E. Winkler,et al.  Multiplicative Noise for Masking Continuous Data , 2001 .

[12]  Vasant Honavar,et al.  Decision Tree Induction from Distributed Heterogeneous Autonomous Data Sources , 2003 .

[13]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[14]  Yufei Tao,et al.  An efficient cost model for optimization of nearest neighbor search in low and medium dimensional spaces , 2004, IEEE Transactions on Knowledge and Data Engineering.

[15]  Chris Clifton,et al.  Privacy Preserving Naïve Bayes Classifier for Vertically Partitioned Data , 2004, SDM.

[16]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[17]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[18]  Aidong Zhang,et al.  ClusterTree: Integration of Cluster Representation and Nearest-Neighbor Search for Large Data Sets with High Dimensions , 2003, IEEE Trans. Knowl. Data Eng..

[19]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[20]  Hillol Kargupta,et al.  A Fourier spectrum-based approach to represent decision trees for mining data streams in mobile environments , 2004, IEEE Transactions on Knowledge and Data Engineering.

[21]  Wenliang Du,et al.  Deriving private information from randomized data , 2005, SIGMOD '05.

[22]  Benny Pinkas,et al.  Cryptographic techniques for privacy-preserving data mining , 2002, SKDD.

[23]  Daniel A. Keim,et al.  Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining , 2002, KDD.

[24]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[25]  Qi Wang,et al.  On the privacy preserving properties of random data perturbation techniques , 2003, Third IEEE International Conference on Data Mining.

[26]  T. Mexia,et al.  Author ' s personal copy , 2009 .

[27]  Joydeep Ghosh,et al.  Privacy-preserving distributed clustering using generative models , 2003, Third IEEE International Conference on Data Mining.

[28]  Gu Si-yang,et al.  Privacy preserving association rule mining in vertically partitioned data , 2006 .

[29]  Thomas G. Dietterich,et al.  Locally Adaptive Nearest Neighbor Algorithms , 1993, NIPS.

[30]  Aryya Gangopadhyay,et al.  A fuzzy programming approach for data reduction and privacy in distance-based mining , 2008, Int. J. Inf. Comput. Secur..

[31]  Hisashi Tanizaki,et al.  Computational methods in statistics and econometrics , 2004 .

[32]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[33]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[34]  Wenliang Du,et al.  Building decision tree classifier on private data , 2002 .

[35]  Rafael A. Calvo,et al.  Fast Dimensionality Reduction and Simple PCA , 1998, Intell. Data Anal..

[36]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[37]  Jayant R. Haritsa,et al.  Maintaining Data Privacy in Association Rule Mining , 2002, VLDB.

[38]  Yuval Rabani,et al.  Linear Programming , 2007, Handbook of Approximation Algorithms and Metaheuristics.

[39]  Zaher Al Aghbari,et al.  Array-index: a plug&search K nearest neighbors method for high-dimensional data , 2005, Data Knowl. Eng..

[40]  Jaideep Vaidya,et al.  Privacy Preserving Naive Bayes Classifier for Horizontally Partitioned Data , 2003 .

[41]  M. Atallah,et al.  Collaborative Research : ITR : Distributed Data Mining to Protect Information Privacy , 2004 .

[42]  Stan Matwin,et al.  Privacy Preserving K-nearest Neighbor Classification , 2005, Int. J. Netw. Secur..

[43]  Chris Clifton,et al.  Tools for privacy preserving distributed data mining , 2002, SKDD.

[44]  Osmar R. Zaïane,et al.  Privacy Preserving Clustering by Data Transformation , 2010, J. Inf. Data Manag..

[45]  Rajeev Motwani,et al.  Anonymizing Tables , 2005, ICDT.

[46]  Wenliang Du,et al.  Secure multi-party computation problems and their applications: a review and open problems , 2001, NSPW '01.

[47]  Elisa Bertino,et al.  State-of-the-art in privacy preserving data mining , 2004, SGMD.

[48]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[49]  G. Baiocchi Computational Methods in Statistics and Econometrics , 2005 .

[50]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[51]  Jayant R. Haritsa,et al.  A Framework for High-Accuracy Privacy-Preserving Mining , 2005, ICDE.

[52]  Chris Clifton,et al.  Privacy-preserving distributed mining of association rules on horizontally partitioned data , 2004, IEEE Transactions on Knowledge and Data Engineering.

[53]  Lei Liu,et al.  Optimal randomization for privacy preserving data mining , 2004, KDD.

[54]  Qi Wang,et al.  Random-data perturbation techniques and privacy-preserving data mining , 2005, Knowledge and Information Systems.

[55]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[56]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[57]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[58]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[59]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.