Privacy-Preserving Clustering by Object Similarity-Based Representation and Dimensionality Reduction Transformation

Preserving privacy of individuals when data are shared for clustering is a challenging problem. Data owners must not only meet privacy requirements but also guarantee valid clustering results. In this paper, we show that this dual goal can be achieved by transforming a database using two simple and effective data transformations: Object Similarity-Based Representation (OSBR) and Dimensionality Reduction-Based Transformation (DRBT). The former relies on the idea behind the similarity between objects, and the latter relies on the intuition behind random projection. The major features of our data transformations are: a) they are independent of distance-based clustering algorithms; b) they have a sound mathematical foundation; and c) they do not require CPU-intensive operations.

[1]  Myron Wish,et al.  Three-Way Multidimensional Scaling , 1978 .

[2]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[3]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[4]  H. V. Jagadish,et al.  A retrieval technique for similar shapes , 1991, SIGMOD '91.

[5]  John W. Auer,et al.  Linear algebra with applications , 1996 .

[6]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[7]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[8]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[9]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[10]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[11]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[12]  Joydeep Ghosh,et al.  Privacy-preserving distributed clustering using generative models , 2003, Third IEEE International Conference on Data Mining.

[13]  Chris Clifton,et al.  Privacy-preserving k-means clustering over vertically partitioned data , 2003, KDD '03.

[14]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[15]  Osmar R. Zaïane,et al.  Achieving Privacy Preservation when Sharing Data for Clustering , 2004, Secure Data Management.

[16]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .