Spherical microaggregation: Anonymizing sparse vector spaces

Unstructured texts are a very popular data type and still widely unexplored in the privacy preserving data mining field. We consider the problem of providing public information about a set of confidential documents. To that end we have developed a method to protect a Vector Space Model (VSM), to make it public even if the documents it represents are private. This method is inspired by microaggregation, a popular protection method from statistical disclosure control, and adapted to work with sparse and high dimensional data sets.

[1]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[2]  Vicenç Torra,et al.  Towards Semantic Microaggregation of Categorical Data for Confidential Documents , 2010, MDAI.

[3]  Vicenç Torra,et al.  Semantic Microaggregation for the Anonymization of Query Logs , 2010, Privacy in Statistical Databases.

[4]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[5]  Balamurugan Anandan,et al.  t-Plausibility: Generalizing Words to Desensitize Text , 2012, Trans. Data Priv..

[6]  David Sánchez,et al.  Semantically-grounded construction of centroids for datasets with textual attributes , 2012, Knowl. Based Syst..

[7]  V. Torra,et al.  Disclosure control methods and information loss for microdata , 2001 .

[8]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[9]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[10]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[11]  David Sánchez,et al.  Semantic adaptive microaggregation of categorical microdata , 2012, Comput. Secur..

[12]  Josep Domingo-Ferrer,et al.  On the complexity of optimal microaggregation for statistical disclosure control , 2001 .

[13]  Vicenç Torra,et al.  Microaggregation for Categorical Variables: A Median Based Approach , 2004, Privacy in Statistical Databases.

[14]  Yücel Saygin,et al.  Privacy-preserving publishing of opinion polls , 2013, Comput. Secur..

[15]  David G. Stork,et al.  Pattern Classification , 1973 .

[16]  Vicenç Torra,et al.  Constrained Microaggregation: Adding Constraints for Data Editing , 2008, Trans. Data Priv..

[17]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[18]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[19]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[20]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[21]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[22]  Mukesh K. Mohania,et al.  Efficient techniques for document sanitization , 2008, CIKM '08.

[23]  Ramakrishnan Srikant,et al.  Privacy-preserving data mining , 2000, SIGMOD '00.

[24]  L. Hubert,et al.  Comparing partitions , 1985 .

[25]  Bing Liu,et al.  Unsupervised non-parametric kernel learning algorithm , 2013, Knowl. Based Syst..

[26]  Guillermo Navarro-Arribas,et al.  Improving record linkage with supervised learning for disclosure risk assessment , 2012, Inf. Fusion.

[27]  Gökhan Tür,et al.  Sanitization and Anonymization of Document Repositories , 2009, Database Technologies: Concepts, Methodologies, Tools, and Applications.

[28]  Panos Kalnis,et al.  On the Anonymization of Sparse High-Dimensional Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[29]  Vicenç Torra,et al.  Towards a private vector space model for confidential documents , 2013, SAC '13.

[30]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[31]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .

[32]  Thomas A. Lasko,et al.  Spectral anonymization of data , 2007 .

[33]  Luis Gravano,et al.  Snowball: a prototype system for extracting relations from large text collections , 2001, SIGMOD '01.

[34]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[35]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[36]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[37]  Josep Domingo-Ferrer,et al.  Record linkage methods for multidatabase data mining , 2003 .

[38]  Guillermo Navarro-Arribas,et al.  User k-anonymity for privacy preserving data mining of query logs , 2012, Inf. Process. Manag..

[39]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[40]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[41]  Ke Wang,et al.  Anonymizing bag-valued sparse data by semantic similarity-based clustering , 2013, Knowledge and Information Systems.

[42]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[43]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.