Privacy: a machine learning view

The problem of disseminating a data set for machine learning while controlling the disclosure of data source identity is described using a commuting diagram of functions. This formalization is used to present and analyze an optimization problem balancing privacy and data utility requirements. The analysis points to the application of a generalization mechanism for maintaining privacy in view of machine learning needs. We present new proofs of NP-hardness of the problem of minimizing information loss while satisfying a set of privacy requirements, both with and without the addition of a particular uniform coding requirement. As an initial analysis of the approximation properties of the problem, we show that the cell suppression problem with a constant number of attributes can be approximated within a constant. As a side effect, proofs of NP-hardness of the minimum k-union, maximum k-intersection, and parallel versions of these are presented. Bounded versions of these problems are also shown to be approximable within a constant.

[1]  William E. Winkler,et al.  Using Simulated Annealing for k-anonymity , 2002 .

[2]  Latanya Sweeney,et al.  Guaranteeing anonymity when sharing medical data, the Datafly System , 1997, AMIA.

[3]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[4]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[5]  Pierangela Samarati,et al.  Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression , 1998 .

[6]  L. Cox Suppression Methodology and Statistical Disclosure Control , 1980 .

[7]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[8]  Murat Kantarcioglu,et al.  An architecture for privacy-preserving mining of client information , 2002 .

[9]  Giorgio Gambosi,et al.  Complexity and approximation: combinatorial optimization problems and their approximability properties , 1999 .

[10]  Stan Matwin,et al.  Privacy-Oriented Data Mining by Proof Checking , 2002, PKDD.

[11]  Sushil Jajodia,et al.  Secure Databases: Constraints, Inference Channels, and Monitoring Disclosures , 2000, IEEE Trans. Knowl. Data Eng..

[12]  Lucila Ohno-Machado,et al.  Using Boolean reasoning to anonymize databases , 1999, Artif. Intell. Medicine.

[13]  Barbara Kostrewski,et al.  Biomedical information: education and decision support systems , 1986, J. Inf. Sci..

[14]  Dorothy E. Denning,et al.  Secure statistical databases with random sample queries , 1980, TODS.

[15]  Gultekin Özsoyoglu,et al.  Controlling FD and MVD Inferences in Multilevel Relational Database Systems , 1991, IEEE Trans. Knowl. Data Eng..

[16]  Matteo Fischetti,et al.  Models and algorithms for the 2-dimensional cell suppression problem in statistical disclosure control , 1999, Math. Program..

[17]  Chris Clifton,et al.  Using Sample Size to Limit Exposure to Data Mining , 2000, J. Comput. Secur..

[18]  Robert J. Schalkoff,et al.  Pattern recognition - statistical, structural and neural approaches , 1991 .

[19]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[20]  Andrzej Skowron,et al.  The Discernibility Matrices and Functions in Information Systems , 1992, Intelligent Decision Support.

[21]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[22]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[23]  Gu Si-yang,et al.  Privacy preserving association rule mining in vertically partitioned data , 2006 .

[24]  Vassilios S. Verykios,et al.  Disclosure limitation of sensitive rules , 1999, Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453).

[25]  G. Rushton,et al.  Geographically masking health data to preserve confidentiality. , 1999, Statistics in medicine.

[26]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[27]  Lucila Ohno-Machado,et al.  Effects of Data Anonymization by Cell Suppression on Descriptive Statistics and Predictive Modeling Performance , 2002, J. Am. Medical Informatics Assoc..

[28]  Ljiljana Brankovic,et al.  Data Swapping: Balancing Privacy against Precision in Mining for Logic Rules , 1999, DaWaK.

[29]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[30]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.