Privacy-preserving data mining

A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data.

[1]  Rolf Oppliger,et al.  Internet security: firewalls and beyond , 1997, CACM.

[2]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[3]  Mark Sullivan,et al.  Quasi-cubes: exploiting approximations in multidimensional databases , 1997, SGMD.

[4]  Christine Hine,et al.  Privacy in the Marketplace , 1998, Inf. Soc..

[5]  Daniel E. Geer,et al.  A survey of Web security , 1998, Computer.

[6]  Ezio Lefons,et al.  An Analytic Approach to Statistical Databases , 1983, VLDB.

[7]  L. Cox Suppression Methodology and Statistical Disclosure Control , 1980 .

[8]  Christos Faloutsos,et al.  Recovering Information from Summary Data , 1997, VLDB.

[9]  Paola Benassi,et al.  TRUSTe: an online privacy seal program , 1999, CACM.

[10]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[11]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.

[12]  Mark S. Ackerman,et al.  Beyond Concern: Understanding Net Users' Attitudes About Online Privacy , 1999, ArXiv.

[13]  Huan Liu,et al.  Book review: Machine Learning, Neural and Statistical Classification Edited by D. Michie, D.J. Spiegelhalter and C.C. Taylor (Ellis Horwood Limited, 1994) , 1996, SGAR.

[14]  Chris Clifton,et al.  SECURITY AND PRIVACY IMPLICATIONS OF DATA MINING , 1996 .

[15]  Steven P. Reiss Practical Data-Swapping: The First Steps , 1980, 1980 IEEE Symposium on Security and Privacy.

[16]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[17]  Dorothy E. Denning,et al.  Secure statistical databases with random sample queries , 1980, TODS.

[18]  Leland L. Beck,et al.  A security machanism for statistical database , 1980, TODS.

[19]  Peter J. Denning,et al.  The tracker: a threat to statistical database security , 1979, TODS.

[20]  Lorrie Faith Cranor,et al.  Internet privacy , 1999, CACM.

[21]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[22]  S L Warner,et al.  Randomized response: a survey technique for eliminating evasive answer bias. , 1965, Journal of the American Statistical Association.

[23]  Dorothy E. Denning,et al.  Cryptography and Data Security , 1982 .

[24]  Oren Etzioni,et al.  Privacy interfaces for information management , 1999, CACM.

[25]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[26]  Gultekin Özsoyoglu,et al.  Auditing and Inference Control in Statistical Databases , 1982, IEEE Transactions on Software Engineering.

[27]  A. Froomkin The Death of Privacy? , 2000 .

[28]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[29]  Ljiljana Brankovic,et al.  Data Swapping: Balancing Privacy against Precision in Mining for Logic Rules , 1999, DaWaK.

[30]  C T Dinardo,et al.  Computers and security , 1986 .

[31]  David A. Wagner,et al.  Privacy-enhancing technologies for the Internet , 1997, Proceedings IEEE COMPCON 97. Digest of Papers.

[32]  Ivan P. Fellegi,et al.  On the Question of Statistical Confidentiality , 1972 .

[33]  Mark S. Ackerman,et al.  Privacy critics: UI components to safeguard users' privacy , 1999, CHI Extended Abstracts.

[34]  Vassilios S. Verykios,et al.  Disclosure limitation of sensitive rules , 1999, Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453).

[35]  H. Engl,et al.  Regularization of Inverse Problems , 1996 .

[36]  Chong K. Liew,et al.  A data distortion by probability distribution , 1985, TODS.

[37]  H. Cramér Mathematical Methods of Statistics (PMS-9), Volume 9 , 1946 .

[38]  Richard J. Lipton,et al.  Secure databases: protection against user influence , 1979, TODS.

[39]  Richard Conway,et al.  Selective partial access to a database , 1976, ACM '76.

[40]  Henryk Wozniakowski,et al.  The statistical security of a statistical database , 1984, TODS.

[41]  Clement T. Yu,et al.  A study on the protection of statistical data bases , 1977, SIGMOD '77.

[42]  Arie Shoshani,et al.  Statistical Databases: Characteristics, Problems, and some Solutions , 1982, VLDB.

[43]  Bhavani M. Thuraisingham,et al.  Design of LDV: a multilevel secure relational database management system , 1990 .