VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository

Researchers currently rely on ad hoc datasets to train automated visualization tools and evaluate the effectiveness of visualization designs. These exemplars often lack the characteristics of real-world datasets, and their one-off nature makes it difficult to compare different techniques. In this paper, we present VizNet: a large-scale corpus of over 31 million datasets compiled from open data repositories and online visualization galleries. On average, these datasets comprise 17 records over 3 dimensions and across the corpus, we find 51% of the dimensions record categorical data, 44% quantitative, and only 5% temporal. VizNet provides the necessary common baseline for comparing visualization design techniques, and developing benchmark models and algorithms for automating visual analysis. To demonstrate VizNet's utility as a platform for conducting online crowdsourced experiments at scale, we replicate a prior study assessing the influence of user task and data distribution on visual encoding effectiveness, and extend it by considering an additional task: outlier detection. To contend with running such studies at scale, we demonstrate how a metric of perceptual effectiveness can be learned from experimental results, and show its predictive power across test datasets.

[1]  Michael S. Bernstein,et al.  Boomerang: Rebounding the Consequences of Reputation Feedback on Crowdsourcing Platforms , 2016, UIST.

[2]  Scott R. Klemmer,et al.  Shepherding the crowd yields better work , 2012, CSCW.

[3]  Minjae Lee,et al.  RNA design rules from a massive open laboratory , 2014, Proceedings of the National Academy of Sciences.

[4]  B. G. Shortridge Stimulus Processing Models from Psychology: Can We Use Them in Cartography? , 1982 .

[5]  John Le,et al.  Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution , 2010 .

[6]  Laura A. Dabbish,et al.  Designing games with a purpose , 2008, CACM.

[7]  Jeffrey Heer,et al.  Beyond Heuristics: Learning Visualization Design , 2018, ArXiv.

[8]  Magdalena Balazinska,et al.  Public Data and Visualizations: How are Many Eyes and Tableau Public Used for Collaborative Analytics? , 2014, SGMD.

[9]  Adrien Treuille,et al.  Predicting protein structures with a multiplayer online game , 2010, Nature.

[10]  Jürgen Umbrich,et al.  Automated Quality Assessment of Metadata across Open Data Portals , 2016, JDIQ.

[11]  Michaël Aupetit,et al.  Data‐driven Evaluation of Visual Quality Measures , 2015, Comput. Graph. Forum.

[12]  Çagatay Demiralp,et al.  Data2Vis: Automatic Generation of Data Visualizations Using Sequence-to-Sequence Recurrent Neural Networks , 2018, IEEE Computer Graphics and Applications.

[13]  Siddharth Suri,et al.  Conducting behavioral research on Amazon’s Mechanical Turk , 2010, Behavior research methods.

[14]  Jeffrey Heer,et al.  Sizing the horizon: the effects of chart size and layering on the graphical perception of time series visualizations , 2009, CHI.

[15]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[17]  William S. Cleveland,et al.  Visualizing Data , 1993 .

[18]  Jeffrey Heer,et al.  Visual Embedding: A Model for Visualization , 2014, IEEE Computer Graphics and Applications.

[19]  David K. Simkin,et al.  An Information-Processing Analysis of Graph Perception , 1987 .

[20]  Vidya Setlur,et al.  Four Experiments on the Perception of Bar Charts , 2014, IEEE Transactions on Visualization and Computer Graphics.

[21]  Younghoon Kim,et al.  Assessing Effects of Task and Data Distribution on the Effectiveness of Visual Encodings , 2018, Comput. Graph. Forum.

[22]  UmbrichJürgen,et al.  Automated Quality Assessment of Metadata across Open Data Portals , 2016 .

[23]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[24]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[25]  Michael S. Bernstein,et al.  Daemo: A Self-Governed Crowdsourcing Marketplace , 2015, UIST.

[26]  Kanit Wongsuphasawat,et al.  Towards a general-purpose query language for visualization recommendation , 2016, HILDA '16.

[27]  Martin Wattenberg,et al.  ManyEyes: a Site for Visualization at Internet Scale , 2007, IEEE Transactions on Visualization and Computer Graphics.

[28]  Alex Endert,et al.  Task-Based Effectiveness of Basic Visualizations , 2017, IEEE Transactions on Visualization and Computer Graphics.

[29]  Jeffrey Heer,et al.  Multi-Scale Banking to 45 Degrees , 2006, IEEE Transactions on Visualization and Computer Graphics.

[30]  Ulrik Brandes,et al.  Generative Data Models for Validation and Evaluation of Visualization Techniques , 2016, BELIV '16.

[31]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[32]  Jeffrey Heer,et al.  Formalizing Visualization Design Knowledge as Constraints: Actionable and Extensible Models in Draco , 2018, IEEE Transactions on Visualization and Computer Graphics.

[33]  Jock D. Mackinlay,et al.  Automating the design of graphical presentations of relational information , 1986, TOGS.

[34]  Zhe Chen,et al.  Spreadsheet Property Detection With Rule-assisted Active Learning , 2017, CIKM.

[35]  Jeffrey Heer,et al.  Crowdsourcing graphical perception: using mechanical turk to assess visualization design , 2010, CHI.

[36]  ASHOK K. AGRAWALA,et al.  Learning with a probabilistic teacher , 1970, IEEE Trans. Inf. Theory.

[37]  Ted S. Sindlinger,et al.  Crowdsourcing: Why the Power of the Crowd is Driving the Future of Business , 2010 .

[38]  Arvind Satyanarayan,et al.  Vega-Lite: A Grammar of Interactive Graphics , 2018, IEEE Transactions on Visualization and Computer Graphics.

[39]  Jennifer G. Dy,et al.  Active Learning from Crowds , 2011, ICML.

[40]  F. edridge-green Tests for Colour-Blindness , 1895, Nature.

[41]  Robert Kosara,et al.  Judgment Error in Pie Chart Variations , 2016, EuroVis.

[42]  Jürgen Umbrich,et al.  Characteristics of Open Data CSV Files , 2016, 2016 2nd International Conference on Open and Big Data (OBD).

[43]  Michelle A. Borkin,et al.  What Makes a Visualization Memorable? , 2013, IEEE Transactions on Visualization and Computer Graphics.

[44]  R. Vose,et al.  An Overview of the Global Historical Climatology Network-Daily Database , 2012 .

[45]  S. Lewandowsky,et al.  Discriminating strata in scatterplots , 1989 .

[46]  Krista A. Ehinger,et al.  SUN Database: Exploring a Large Collection of Scene Categories , 2014, International Journal of Computer Vision.

[47]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[48]  Jevin D. West,et al.  Viziometrics: Analyzing Visual Information in the Scientific Literature , 2016, IEEE Transactions on Big Data.

[49]  Robert Kosara,et al.  Arcs, Angles, or Areas: Individual Data Encodings in Pie and Donut Charts , 2016, Comput. Graph. Forum.

[50]  George A. Miller WordNet: A Lexical Database for English , 1992, HLT.

[51]  Pat Hanrahan,et al.  Arc Length-Based Aspect Ratio Selection , 2011, IEEE Transactions on Visualization and Computer Graphics.

[52]  L. Tremmel The Visual Separability of Plotting Symbols in Scatterplots , 1995 .

[53]  Carsten Binnig,et al.  IDEBench: A Benchmark for Interactive Data Exploration , 2018, SIGMOD Conference.

[54]  Tim Kraska,et al.  VizML: A Machine Learning Approach to Visualization Recommendation , 2018, CHI.

[55]  石原 忍 Tests for Colour-Blindness , 1910, Nature.

[56]  W. Cleveland,et al.  Variables on Scatterplots Look More Highly Correlated When the Scales Are Increased , 1982, Science.

[57]  Michael Stonebraker,et al.  Beagle : Automated Extraction and Interpretation of Visualizations from the Web , 2017 .

[58]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[59]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[60]  W. Cleveland,et al.  Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods , 1984 .

[61]  Pat Hanrahan,et al.  An Extension of Wilkinson’s Algorithm for Positioning Tick Labels on Axes , 2010, IEEE Transactions on Visualization and Computer Graphics.

[62]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[63]  Guoliang Li,et al.  DeepEye: Towards Automatic Data Visualization , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[64]  Michael S. Bernstein,et al.  The future of crowd work , 2013, CSCW.

[65]  Alan M. MacEachren,et al.  How Maps Work - Representation, Visualization, and Design , 1995 .

[66]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[67]  C. Lintott,et al.  Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey , 2008, 0804.4483.

[68]  Michael S. Bernstein,et al.  Learning Perceptual Kernels for Visualization Design , 2014, IEEE Transactions on Visualization and Computer Graphics.

[69]  Michael Stonebraker,et al.  Position statement: The case for a visualization performance benchmark , 2017, 2017 IEEE Workshop on Data Systems for Interactive Analysis (DSIA).

[70]  S. Lewandowsky,et al.  Displaying proportions and percentages , 1991 .

[71]  James R. Eagan,et al.  Low-level components of analytic activity in information visualization , 2005, IEEE Symposium on Information Visualization, 2005. INFOVIS 2005..

[72]  Tamara Munzner,et al.  Overview Use in Multiple Visual Information Resolution Interfaces , 2007, IEEE Transactions on Visualization and Computer Graphics.

[73]  Christophe Ley,et al.  Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median , 2013 .

[74]  Jeffrey Heer,et al.  Perceptual Guidelines for Creating Rectangular Treemaps , 2010, IEEE Transactions on Visualization and Computer Graphics.

[75]  Tamara Munzner,et al.  A Taxonomy of Visual Cluster Separation Factors , 2012, Comput. Graph. Forum.

[76]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.