Privacy Preserving Database Generation for Database Application Testing

Testing of database applications is of great importance. Although various studies have been conducted to investigate testing techniques for database design, relatively few efforts have been made to explicitly address the testing of database applications which requires a large amount of representative data available. As testing over live production databases is often infeasible in many situations due to the high risks of disclosure of confidential information or incorrect updating of real data, in this paper we investigate the problem of generating synthetic databases based on a-priori knowledge about production databases. Our approach is to fit the general location model using various characteristics (e.g., constraints, statistics, rules) extracted from a production database and then generate synthetic data using model learned. The generated data is valid and similar to real data in terms of statistical distribution, hence it can be used for functional and performance testing. As characteristics extracted may contain information which may be used by attackers to derive some confidential information about individuals, we present our disclosure analysis method which applies cell suppression technique for identity disclosure and perturbation for value disclosure analysis.

[1]  Josep Domingo-Ferrer,et al.  Current Directions in Statistical Data Protection , 2004 .

[2]  P. Oakes Quest , 2000 .

[3]  Sophie Tarbouriech,et al.  LMI Approximations for the Radius of the Intersection of Ellipsoids: Survey , 2001 .

[4]  Yongge Wang,et al.  Privacy aware data generation for testing database applications , 2005, 9th International Database Engineering & Application Symposium (IDEAS'05).

[5]  Phyllis G. Frankl,et al.  A framework for testing database applications , 2000, ISSTA '00.

[6]  Daniel M. Dias,et al.  A modeling study of the TPC-C benchmark , 1993, SIGMOD '93.

[7]  S. Fienberg,et al.  Bounds for cell entries in contingency tables induced by fixed marginal totals with applications to disclosure limitation , 2001 .

[8]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[9]  Jayant R. Haritsa,et al.  Maintaining Data Privacy in Association Rule Mining , 2002, VLDB.

[10]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[11]  Meikel Pöss,et al.  Generating Thousand Benchmark Queries in Seconds , 2004, VLDB.

[12]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[13]  Kenneth Baclawski,et al.  Quickly generating billion-record synthetic databases , 1994, SIGMOD '94.

[14]  Ron Kohavi,et al.  Real world performance of association rule algorithms , 2001, KDD '01.

[15]  Meikel Pöss,et al.  MUDD: a multi-dimensional data generator , 2004, WOSP '04.

[16]  Yongge Wang,et al.  Privacy preserving database application testing , 2003, WPES '03.

[17]  Yücel Saygin,et al.  Privacy preserving association rule mining , 2002, Proceedings Twelfth International Workshop on Research Issues in Data Engineering: Engineering E-Commerce/E-Business Systems RIDE-2EC 2002.

[18]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[19]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[20]  Elaine J. Weyuker,et al.  AGENDA: A test generator for relational database applications , 2002 .

[21]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[22]  L. Lovász,et al.  Geometric Algorithms and Combinatorial Optimization , 1981 .

[23]  Sophie Tarbouriech,et al.  LMI approximations for the radius of the intersection of ellipsoids , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[24]  S E Fienberg,et al.  INAUGURAL ARTICLE by a Recently Elected Academy Member:Bounds for cell entries in contingency tables given marginal totals and decomposable graphs , 2000 .

[25]  Ramakrishnan Srikant,et al.  Privacy-preserving data mining , 2000, SIGMOD '00.

[26]  Yongge Wang,et al.  Statistical Database Modeling for Privacy Preserving Database Generation , 2005, ISMIS.