Dscaler: Synthetically Scaling A Given Relational Database

The Dataset Scaling Problem (DSP) defined in previous work states: Given an empirical set of relational tables D and a scale factor s, generate a database state D that is similar to D but s times its size. A DSP solution is useful for application development (s 1) and anonymization (s = 1). Current solutions assume all table sizes scale by the same ratio s. However, a real database tends to have tables that grow at different rates. This paper therefore considers non-uniform scaling (nuDSP), a DSP generalization where, instead of a single scale factor s, tables can scale by different factors. Dscaler is the first solution for nuDSP. It follows previous work in achieving similarity by reproducing correlation among the primary and foreign keys. However, it introduces the concept of a correlation database that captures fine-grained, per-tuple correlation. Experiments with well-known real and synthetic datasets D show that Dscaler produces D with greater similarity to D than state-of-the-art techniques. Here, similarity is measured by number of tuples, frequency distribution of foreign key references, and multi-join aggregate queries.

[1]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[2]  Wing-Kai Hon,et al.  Generating databases for query workloads , 2010, Proc. VLDB Endow..

[3]  Y. C. Tay,et al.  Data generation for application-specific benchmarking , 2011, Proc. VLDB Endow..

[4]  Wolfgang Lehner,et al.  Linked Bernoulli Synopses: Sampling along Foreign Keys , 2008, SSDBM.

[5]  Xing Xie,et al.  Effective Social Graph Deanonymization Based on Graph Structure and Descriptive Information , 2015, ACM Trans. Intell. Syst. Technol..

[6]  Z. Meral Özsoyoglu,et al.  RBench: Application-Specific RDF Benchmarking , 2015, SIGMOD Conference.

[7]  Neoklis Polyzotis,et al.  Private Database Synthesis for Outsourced System Evaluation , 2011, AMW.

[8]  Philip S. Yu,et al.  Efficient classification across multiple database relations: a CrossMine approach , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  Surajit Chaudhuri,et al.  Flexible Database Generators , 2005, VLDB.

[10]  Carsten Binnig,et al.  QAGen: generating query-aware test databases , 2007, SIGMOD '07.

[11]  Kenneth Baclawski,et al.  Quickly generating billion-record synthetic databases , 1994, SIGMOD '94.

[12]  Ilya Mironov,et al.  Differentially private recommender systems , 2009 .

[13]  Ion Stoica,et al.  Blink and It's Done: Interactive Queries on Very Large Data , 2012, Proc. VLDB Endow..

[14]  Jian Li,et al.  Data generation using declarative constraints , 2011, SIGMOD '11.

[15]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[16]  Octavian Udrea,et al.  Apples and oranges: a comparison of RDF benchmarks and real RDF datasets , 2011, SIGMOD '11.

[17]  Thomas Cerqueus,et al.  ReX: Extrapolating Relational Data in a Representative Way , 2015, BICOD.

[18]  Surajit Chaudhuri,et al.  Effective use of block-level sampling in statistics estimation , 2004, SIGMOD '04.

[19]  Christos Faloutsos,et al.  Density biased sampling: an improved method for data mining and clustering , 2000, SIGMOD '00.

[20]  Shazia Wasim Sadiq,et al.  Sampling dirty data for matching attributes , 2010, SIGMOD Conference.

[21]  Carsten Binnig,et al.  Reverse Query Processing , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[22]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[23]  Meikel Pöss,et al.  MUDD: a multi-dimensional data generator , 2004, WOSP '04.

[24]  Tilmann Rabl,et al.  Just can't get enough: Synthesizing Big Data , 2015, SIGMOD Conference.

[25]  Y. C. Tay,et al.  UpSizeR: Synthetically scaling an empirical relational database , 2013, Inf. Syst..

[26]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[27]  John Murphy,et al.  VFDS: An Application to Generate Fast Sample Databases , 2014, CIKM.

[28]  Rico Wind,et al.  Simple and realistic data generation , 2006, VLDB.

[29]  Ilya Mironov,et al.  Differentially private recommender systems: building privacy into the net , 2009, KDD.

[30]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[31]  Gerome Miklau,et al.  Generating private synthetic databases for untrusted system evaluation , 2014, 2014 IEEE 30th International Conference on Data Engineering.