论文信息 - Spherical microaggregation: Anonymizing sparse vector spaces

Spherical microaggregation: Anonymizing sparse vector spaces

Unstructured texts are a very popular data type and still widely unexplored in the privacy preserving data mining field. We consider the problem of providing public information about a set of confidential documents. To that end we have developed a method to protect a Vector Space Model (VSM), to make it public even if the documents it represents are private. This method is inspired by microaggregation, a popular protection method from statistical disclosure control, and adapted to work with sparse and high dimensional data sets.

[1] Inderjit S. Dhillon,et al. Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[2] Vicenç Torra,et al. Towards Semantic Microaggregation of Categorical Data for Confidential Documents , 2010, MDAI.

[3] Vicenç Torra,et al. Semantic Microaggregation for the Anonymization of Query Logs , 2010, Privacy in Statistical Databases.

[4] Gerard Salton,et al. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[5] Balamurugan Anandan,et al. t-Plausibility: Generalizing Words to Desensitize Text , 2012, Trans. Data Priv..

[6] David Sánchez,et al. Semantically-grounded construction of centroids for datasets with textual attributes , 2012, Knowl. Based Syst..

[7] V. Torra,et al. Disclosure control methods and information loss for microdata , 2001 .

[8] Vipin Kumar,et al. Introduction to Data Mining, (First Edition) , 2005 .

[9] Josep Domingo-Ferrer,et al. Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[10] Emanuele Della Valle,et al. An Introduction to Information Retrieval , 2013 .

[11] David Sánchez,et al. Semantic adaptive microaggregation of categorical microdata , 2012, Comput. Secur..

[12] Josep Domingo-Ferrer,et al. On the complexity of optimal microaggregation for statistical disclosure control , 2001 .

[13] Vicenç Torra,et al. Microaggregation for Categorical Variables: A Median Based Approach , 2004, Privacy in Statistical Databases.

[14] Yücel Saygin,et al. Privacy-preserving publishing of opinion polls , 2013, Comput. Secur..

[15] David G. Stork,et al. Pattern Classification , 1973 .

[16] Vicenç Torra,et al. Constrained Microaggregation: Adding Constraints for Data Editing , 2008, Trans. Data Priv..

[17] Daniel T. Larose,et al. Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[18] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[19] Charu C. Aggarwal,et al. On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[20] Vitaly Shmatikov,et al. Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[21] David G. Stork,et al. Pattern Classification (2nd ed.) , 1999 .

[22] Mukesh K. Mohania,et al. Efficient techniques for document sanitization , 2008, CIKM '08.

[23] Ramakrishnan Srikant,et al. Privacy-preserving data mining , 2000, SIGMOD '00.

[24] L. Hubert,et al. Comparing partitions , 1985 .

[25] Bing Liu,et al. Unsupervised non-parametric kernel learning algorithm , 2013, Knowl. Based Syst..

[26] Guillermo Navarro-Arribas,et al. Improving record linkage with supervised learning for disclosure risk assessment , 2012, Inf. Fusion.

[27] Gökhan Tür,et al. Sanitization and Anonymization of Document Repositories , 2009, Database Technologies: Concepts, Methodologies, Tools, and Applications.

[28] Panos Kalnis,et al. On the Anonymization of Sparse High-Dimensional Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[29] Vicenç Torra,et al. Towards a private vector space model for confidential documents , 2013, SAC '13.

[30] Josep Domingo-Ferrer,et al. Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[31] L. Willenborg,et al. Elements of Statistical Disclosure Control , 2000 .

[32] Thomas A. Lasko,et al. Spectral anonymization of data , 2007 .

[33] Luis Gravano,et al. Snowball: a prototype system for extracting relations from large text collections , 2001, SIGMOD '01.

[34] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[35] Massimo Barbaro,et al. A Face Is Exposed for AOL Searcher No , 2006 .

[36] Latanya Sweeney,et al. k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[37] Josep Domingo-Ferrer,et al. Record linkage methods for multidatabase data mining , 2003 .

[38] Guillermo Navarro-Arribas,et al. User k-anonymity for privacy preserving data mining of query logs , 2012, Inf. Process. Manag..

[39] R. Mooney,et al. Impact of Similarity Measures on Web-page Clustering , 2000 .

[40] Ashwin Machanavajjhala,et al. l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[41] Ke Wang,et al. Anonymizing bag-valued sparse data by semantic similarity-based clustering , 2013, Knowledge and Information Systems.

[42] Pierangela Samarati,et al. Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[43] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.