Is this Real?: Generating Synthetic Data that Looks Real

Synner is a tool that helps users generate real-looking synthetic data by visually and declaratively specifying the properties of the dataset such as each field's statistical distribution, its domain, and its relationship to other fields. It provides instant feedback on every user interaction by updating multiple visualizations of the generated dataset and even suggests data generation specifications from a few user examples and interactions. Synner visually communicates the inherent randomness of statistical data generation. Our evaluation of Synner demonstrates its effectiveness at generating realistic data when compared with Mockaroo, a popular data generation tool, and with hired developers who coded data generation scripts for a fee.

[1]  Andrew T Jebb,et al.  Happiness, income satiation and turning points around the world , 2018, Nature Human Behaviour.

[2]  Lyublena Antova,et al.  Reversing statistics for scalable test databases generation , 2013, DBTest '13.

[3]  Ronitt Rubinfeld,et al.  Rapid Sampling for Visualizations with Ordering Guarantees , 2014, Proc. VLDB Endow..

[4]  Rico Wind,et al.  Simple and realistic data generation , 2006, VLDB.

[5]  Martyn Plummer,et al.  JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling , 2003 .

[6]  Michael J. Cafarella,et al.  Visualization-aware sampling for very large databases , 2015, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[7]  Kenneth Baclawski,et al.  Quickly generating billion-record synthetic databases , 1994, SIGMOD '94.

[8]  Rui Han,et al.  Benchmarking Big Data Systems: State-of-the-Art and Future Directions , 2015, ArXiv.

[9]  Sean A. Munson,et al.  When (ish) is My Bus?: User-centered Visualizations of Uncertainty in Everyday, Mobile Predictive Systems , 2016, CHI.

[10]  Surajit Chaudhuri,et al.  Generating Queries with Cardinality Constraints for DBMS Testing , 2006, IEEE Transactions on Knowledge and Data Engineering.

[11]  Azza Abouzeid,et al.  Expressive Time Series Querying with Hand-Drawn Scale-Free Sketches , 2018, CHI.

[12]  Tilmann Rabl,et al.  Just can't get enough: Synthesizing Big Data , 2015, SIGMOD Conference.

[13]  Andrew Thomas,et al.  WinBUGS - A Bayesian modelling framework: Concepts, structure, and extensibility , 2000, Stat. Comput..

[14]  Jon Edvardsson,et al.  A Survey on Automatic Test Data Generation , 2002 .

[15]  Matthew Kay,et al.  In Pursuit of Error: A Survey of Uncertainty Visualization Evaluation , 2019, IEEE Transactions on Visualization and Computer Graphics.

[16]  Jessica Hullman,et al.  Why Authors Don't Visualize Uncertainty , 2019, IEEE Transactions on Visualization and Computer Graphics.

[17]  P. Resnick,et al.  Hypothetical Outcome Plots Outperform Error Bars and Violin Plots for Inferences about Reliability of Variable Ordering , 2015, PloS one.

[18]  Emina Torlak,et al.  Scalable test data generation from multidimensional models , 2012, SIGSOFT FSE.

[19]  Jeffrey Heer,et al.  Predictive Interaction for Data Transformation , 2015, CIDR.

[20]  Surajit Chaudhuri,et al.  Flexible Database Generators , 2005, VLDB.

[21]  Ken Brodlie,et al.  A Review of Uncertainty in Data Visualization , 2012, Expanding the Frontiers of Visual Analytics and Visualization.

[22]  Eric Horvitz,et al.  Principles of mixed-initiative user interfaces , 1999, CHI '99.

[23]  Chester Ismay,et al.  The fivethirtyeight R Package: "Tame Data" Principles for Introductory Statistics and Data Science Courses , 2018 .