Towards publishing set-valued data with high utility

Set-valued data are common in databases which usually contain sensitive information that is associated with data owners. Publishing set-valued data may lead to identity breaches. Pioneering techniques de-identify data by k-anonymity which may produce anonymized data of low utility. K-anonymity must be carried out based on the assumption that a presetting taxonomy tree exists. In this paper, we investigate the negative influence of taxonomy tree on data utility, and propose a novel method to anonymize data in a utility-preserving manner. We artificially construct a pseudo taxonomy tree based on utility metrics. Experiments show that our construct-then-anonymize method is not only available for anonymizing set-valued data, but also provides considerable improvement on data utility.

[1]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[2]  Benjamin C. M. Fung,et al.  Centralized and Distributed Anonymization for High-Dimensional Healthcare Data , 2010, TKDD.

[3]  Jian Pei,et al.  Utility-based anonymization using local recoding , 2006, KDD '06.

[4]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[5]  Panos Kalnis,et al.  On the Anonymization of Sparse High-Dimensional Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[6]  Aris Gkoulalas-Divanis,et al.  Utility-guided Clustering-based Transaction Data Anonymization , 2012, Trans. Data Priv..

[7]  Aris Gkoulalas-Divanis,et al.  Efficient and flexible anonymization of transaction data , 2012, Knowledge and Information Systems.

[8]  Panos Kalnis,et al.  Local and global recoding methods for anonymizing set-valued data , 2010, The VLDB Journal.

[9]  Chris Clifton,et al.  Thoughts on k-Anonymization , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[10]  B. K. Tripathy,et al.  Improved Algorithms for Anonymization of Set-Valued Data , 2012, ACITY.

[11]  Jian Pei,et al.  Publishing Sensitive Transactions for Itemset Utility , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[12]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[13]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[14]  Jiawei Han,et al.  Mining Multiple-Level Association Rules in Large Databases , 1999, IEEE Trans. Knowl. Data Eng..

[15]  Aris Gkoulalas-Divanis,et al.  PCTA: privacy-constrained clustering-based transaction data anonymization , 2011, PAIS '11.

[16]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[17]  Bradley Malin,et al.  COAT: COnstraint-based anonymization of transactions , 2010, Knowledge and Information Systems.

[18]  Wendy Hui Wang,et al.  Towards publishing recommendation data with predictive anonymization , 2010, ASIACCS '10.

[19]  Ying Xu,et al.  Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees , 2002, Bioinform..

[20]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[21]  Philip S. Yu,et al.  Anonymizing transaction databases for publication , 2008, KDD.

[22]  B. Malin,et al.  Anonymization of electronic medical records for validating genome-wide association studies , 2010, Proceedings of the National Academy of Sciences.

[23]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[24]  Danfeng Yao,et al.  The union-split algorithm and cluster-based anonymization of social networks , 2009, ASIACCS '09.

[25]  Benjamin C. M. Fung,et al.  Anonymizing healthcare data: a case study on the blood transfusion service , 2009, KDD.