A tool framework for tweaking features in synthetic datasets

Researchers and developers use benchmarks to compare their algorithms and products. A database benchmark must have a dataset D. To be application-specific, this dataset D should be empirical. However, D may be too small, or too large, for the benchmarking experiments. D must, therefore, be scaled to the desired size. To ensure the scaled D' is similar to D, previous work typically specifies or extracts a fixed set of features F = {F_1, F_2, . . . , F_n} from D, then uses F to generate synthetic data for D'. However, this approach (D -> F -> D') becomes increasingly intractable as F gets larger, so a new solution is necessary. Different from existing approaches, this paper proposes ASPECT to scale D to enforce similarity. ASPECT first uses a size-scaler (S0) to scale D to D'. Then the user selects a set of desired features F'_1, . . . , F'_n. For each desired feature F'_k, there is a tweaking tool T_k that tweaks D' to make sure D' has the required feature F'_k. ASPECT coordinates the tweaking of T_1,...,T_n to D', so T_n(...(T_1(D'))...) has the required features F'_1,...,F'_n. By shifting from D -> F -> D' to D -> D' -> F', data scaling becomes flexible. The user can customise the scaled dataset with their own interested features. Extensive experiments on real datasets show that ASPECT can enforce similarity in the dataset effectively and efficiently.

[1]  Kalyan Veeramachaneni,et al.  The Synthetic Data Vault , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[2]  Timothy G. Armstrong,et al.  LinkBench: a database benchmark based on the Facebook social graph , 2013, SIGMOD '13.

[3]  Kenneth Baclawski,et al.  Quickly generating billion-record synthetic databases , 1994, SIGMOD '94.

[4]  Ralf Hartmut Güting,et al.  MWGen: A Mini World Generator , 2012, 2012 IEEE 13th International Conference on Mobile Data Management.

[5]  J. W. Zhang,et al.  Dscaler: Synthetically Scaling A Given Relational Database , 2016, Proc. VLDB Endow..

[6]  J. W. Zhang,et al.  GSCALER: Synthetically Scaling A Given Graph , 2016, EDBT.

[7]  George H. L. Fletcher,et al.  gMark: Schema-Driven Generation of Graphs and Queries , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[8]  B. Bollobás The evolution of random graphs , 1984 .

[9]  Diego Calvanese,et al.  Fast and Simple Data Scaling for OBDA Benchmarks , 2016, BLINK@ISWC.

[10]  Rico Wind,et al.  Simple and realistic data generation , 2006, VLDB.

[11]  David J. DeWitt,et al.  The TEXTURE Benchmark: Measuring Performance of Text Queries on a Relational DBMS , 2005, VLDB.

[12]  Marianne Winslett,et al.  Chronos: An elastic parallel framework for stream benchmark generation and simulation , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[13]  Himchan Park,et al.  TrillionG: A Trillion-scale Synthetic Graph Generator using a Recursive Vector Model , 2017, SIGMOD Conference.

[14]  Fei Wang,et al.  Perceiving Group Themes from Collective Social and Behavioral Information , 2015, AAAI.

[15]  Y. C. Tay,et al.  Data Generation for Application-Specific Benchmarking , 2011 .

[16]  Tilmann Rabl,et al.  A Data Generator for Cloud-Scale Benchmarking , 2010, TPCTC.

[17]  Meikel Pöss,et al.  MUDD: a multi-dimensional data generator , 2004, WOSP '04.

[18]  Tilmann Rabl,et al.  Just can't get enough: Synthesizing Big Data , 2015, SIGMOD Conference.

[19]  Y. C. Tay,et al.  UpSizeR: Synthetically scaling an empirical relational database , 2013, Inf. Syst..

[20]  Surajit Chaudhuri,et al.  Flexible Database Generators , 2005, VLDB.

[21]  Zhifeng Bao,et al.  sonSchema: A Conceptual Schema for Social Networks , 2013, ER.

[22]  Sivan Toledo,et al.  SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks , 2015, FAST.

[23]  Z. Meral Özsoyoglu,et al.  RBench: Application-Specific RDF Benchmarking , 2015, SIGMOD Conference.

[24]  John Michael Robson,et al.  Algorithms for Maximum Independent Sets , 1986, J. Algorithms.

[25]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[26]  Michael Stonebraker,et al.  A New Direction for TPC? , 2009, TPCTC.

[27]  Fan Chung Graham,et al.  A Random Graph Model for Power Law Graphs , 2001, Exp. Math..

[28]  Thomas Cerqueus,et al.  ReX: Extrapolating Relational Data in a Representative Way , 2015, BICOD.

[29]  Jennifer Neville,et al.  Incorporating Assortativity and Degree Dependence into Scalable Network Models , 2015, AAAI.

[30]  Hassan Chafi,et al.  The LDBC Social Network Benchmark: Interactive Workload , 2015, SIGMOD Conference.

[31]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[32]  Yi-Liang Zhao,et al.  Volunteerism Tendency Prediction via Harvesting Multiple Social Networks , 2016, ACM Trans. Inf. Syst..

[33]  Jian Li,et al.  Data generation using declarative constraints , 2011, SIGMOD '11.

[34]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[35]  Tat-Seng Chua,et al.  Fast Matrix Factorization for Online Recommendation with Implicit Feedback , 2016, SIGIR.

[36]  Wolfgang Lehner,et al.  Linked Bernoulli Synopses: Sampling along Foreign Keys , 2008, SSDBM.

[37]  Craig W. Thompson,et al.  A parallel general-purpose synthetic data generator , 2007, SGMD.

[38]  P. Erdos,et al.  On the evolution of random graphs , 1984 .