Rapid Sampling for Visualizations with Ordering Guarantees

Visualizations are frequently used as a means to understand trends and gather insights from datasets, but often take a long time to generate. In this paper, we focus on the problem of rapidly generating approximate visualizations while preserving crucial visual properties of interest to analysts. Our primary focus will be on sampling algorithms that preserve the visual property of ordering; our techniques will also apply to some other visual properties. For instance, our algorithms can be used to generate an approximate visualization of a bar chart very rapidly, where the comparisons between any two bars are correct. We formally show that our sampling algorithms are generally applicable and provably optimal in theory, in that they do not take more samples than necessary to generate the visualizations with ordering guarantees. They also work well in practice, correctly ordering output groups while taking orders of magnitude fewer samples and much less time than conventional sampling schemes.

[1]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[2]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[3]  Diansheng Guo,et al.  Coordinating Computational and Visual Approaches for Interactive Feature Selection and Multivariate Clustering , 2003, Inf. Vis..

[4]  Danyel Fisher,et al.  Incremental, approximate database queries and uncertainty for exploratory visualization , 2011, 2011 IEEE Symposium on Large Data Analysis and Visualization.

[5]  Eli Upfal,et al.  Computing with Noisy Information , 1994, SIAM J. Comput..

[6]  Wen-Chi Hou,et al.  Statistical estimators for relational algebra expressions , 1988, PODS '88.

[7]  A. Winsor Sampling techniques. , 2000, Nursing times.

[8]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[9]  M.O. Ward,et al.  Prefetching for visual data exploration , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[10]  Nick Koudas Space efficient bitmap indexing , 2000, CIKM '00.

[11]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[12]  Jeffrey Heer,et al.  imMens: Real‐time Visual Querying of Big Data , 2013, Comput. Graph. Forum.

[13]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[14]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[15]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[16]  Edward R. Tufte,et al.  The Visual Display of Quantitative Information , 1986 .

[17]  Aditya G. Parameswaran,et al.  So who won?: dynamic max discovery with the crowd , 2012, SIGMOD Conference.

[18]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[19]  Jussi Myllymaki,et al.  DEVise: Integrated Querying and Visualization of Large Datasets , 1997, SIGMOD Conference.

[20]  Jeffrey F. Naughton,et al.  Efficient Sampling Strategies for Relational Database Operations , 1993, Theor. Comput. Sci..

[21]  Arie Shoshani,et al.  Optimizing bitmap indices with efficient compression , 2006, TODS.

[22]  Christian S. Jensen,et al.  Google fusion tables: web-centered data management and collaboration , 2010, SIGMOD Conference.

[23]  Daniel Perry,et al.  VizDeck: self-organizing dashboards for visual analytics , 2012, SIGMOD Conference.

[24]  Ran Canetti,et al.  Lower Bounds for Sampling Algorithms for Estimating the Average , 1995, Inf. Process. Lett..

[25]  Pat Hanrahan Analytic database technologies for a new kind of user: the data enthusiast , 2012, SIGMOD Conference.

[26]  Aditya G. Parameswaran,et al.  SeeDB: visualizing database queries efficiently , 2013, VLDB 2013.

[27]  IndykPiotr,et al.  Rapid sampling for visualizations with ordering guarantees , 2015, VLDB 2015.

[28]  Robert L. Wolpert,et al.  Statistical Inference , 2019, Encyclopedia of Social Network Analysis and Mining.

[29]  Jeffrey Heer,et al.  Profiler: integrated statistical analysis and visualization for data quality assessment , 2012, AVI.

[30]  Ben Shneiderman,et al.  A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data , 2005, Inf. Vis..

[31]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[32]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[33]  Danyel Fisher,et al.  Hotmap: Looking at Geographic Attention , 2007, IEEE Transactions on Visualization and Computer Graphics.

[34]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[35]  Matthew O. Ward,et al.  Visual Hierarchical Dimension Reduction for Exploration of High Dimensional Datasets , 2003, VisSym.

[36]  Edward Rolf Tufte,et al.  The visual display of quantitative information , 1985 .

[37]  Arie Shoshani,et al.  Analyses of multi-level and multi-component compressed bitmap indexes , 2010, TODS.

[38]  Chris Jermaine,et al.  Robust Stratified Sampling Plans for Low Selectivity Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[39]  Edward R. Tufte The visual display of quantative information graphics press , 1983 .

[40]  Wen-Chi Hou,et al.  Processing aggregate relational queries with hard time constraints , 1989, SIGMOD '89.

[41]  Samuel Madden NEEDLETAIL: A System for Browsing Queries , 2014 .

[42]  Ramon Lawrence,et al.  Time series compression for adaptive chart generation , 2013, 2013 26th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE).

[43]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[44]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[45]  Chris Jermaine,et al.  Scalable approximate query processing with the DBO engine , 2008, TODS.

[46]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[47]  Monica M. C. Schraefel,et al.  Trust me, i'm partially right: incremental visualization lets analysts explore large datasets faster , 2012, CHI.

[48]  Nuno Constantino Castro,et al.  Time Series Data Mining , 2009, Encyclopedia of Database Systems.

[49]  Larry Wasserman,et al.  All of Statistics , 2004 .

[50]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[51]  Yannis E. Ioannidis,et al.  Histogram-Based Approximation of Set-Valued Query-Answers , 1999, VLDB.

[52]  Phillip B. Gibbons,et al.  Approximate Query Processing: Taming the TeraBytes! A Tutorial , 2001 .

[53]  Sanjeev Khanna,et al.  Using the crowd for top-k and group-by queries , 2013, ICDT '13.

[54]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[55]  Pat Hanrahan,et al.  Polaris: a system for query, analysis, and visualization of multidimensional databases , 2008, Commun. ACM.

[56]  R. Serfling Probability Inequalities for the Sum in Sampling without Replacement , 1974 .