Rapid sampling for visualizations with ordering guarantees Citation

Visualizations are frequently used as a means to understand trends and gather insights from datasets, but often take a long time to generate. In this paper, we focus on the problem of rapidly generating approximate visualizations while preserving crucial visual properties of interest to analysts. Our primary focus will be on sampling algorithms that preserve the visual property of ordering; our techniques will also apply to some other visual properties. For instance, our algorithms can be used to generate an approximate visualization of a bar chart very rapidly, where the comparisons between any two bars are correct. We formally show that our sampling algorithms are generally applicable and provably optimal in theory, in that they do not take more samples than necessary to generate the visualizations with ordering guarantees. They also work well in practice, correctly ordering output groups while taking orders of magnitude fewer samples and much less time than conventional sampling schemes.

[1]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[2]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[3]  R. Serfling Probability Inequalities for the Sum in Sampling without Replacement , 1974 .

[4]  Edward Rolf Tufte,et al.  The visual display of quantitative information , 1985 .

[5]  B. Marx The Visual Display of Quantitative Information , 1985 .

[6]  Wen-Chi Hou,et al.  Statistical estimators for relational algebra expressions , 1988, PODS '88.

[7]  Wen-Chi Hou,et al.  Processing aggregate relational queries with hard time constraints , 1989, SIGMOD '89.

[8]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[9]  Jeffrey F. Naughton,et al.  Efficient Sampling Strategies for Relational Database Operations , 1993, Theor. Comput. Sci..

[10]  Ran Canetti,et al.  Lower Bounds for Sampling Algorithms for Estimating the Average , 1995, Inf. Process. Lett..

[11]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[12]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[13]  Jussi Myllymaki,et al.  DEVise: Integrated Querying and Visualization of Large Datasets , 1997, SIGMOD Conference.

[14]  Yannis E. Ioannidis,et al.  Histogram-Based Approximation of Set-Valued Query-Answers , 1999, VLDB.

[15]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[16]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[17]  A. Winsor Sampling techniques. , 2000, Nursing times.

[18]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[19]  Nick Koudas Space efficient bitmap indexing , 2000, CIKM '00.

[20]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[21]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[22]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[23]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[24]  Matthew O. Ward,et al.  Visual Hierarchical Dimension Reduction for Exploration of High Dimensional Datasets , 2003, VisSym.

[25]  M.O. Ward,et al.  Prefetching for visual data exploration , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[26]  Diansheng Guo,et al.  Coordinating Computational and Visual Approaches for Interactive Feature Selection and Multivariate Clustering , 2003, Inf. Vis..

[27]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[28]  Larry Wasserman,et al.  All of Statistics , 2004 .

[29]  Ben Shneiderman,et al.  A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data , 2005, Inf. Vis..

[30]  Arie Shoshani,et al.  Optimizing bitmap indices with efficient compression , 2006, TODS.

[31]  Danyel Fisher,et al.  Hotmap: Looking at Geographic Attention , 2007, IEEE Transactions on Visualization and Computer Graphics.

[32]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[33]  Chris Jermaine,et al.  Scalable approximate query processing with the DBO engine , 2007, SIGMOD '07.

[34]  Polaris: a system for query, analysis, and visualization of multidimensional databases , 2008, Commun. ACM.

[35]  Chris Jermaine,et al.  Robust Stratified Sampling Plans for Low Selectivity Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[36]  Arie Shoshani,et al.  Analyses of multi-level and multi-component compressed bitmap indexes , 2010, TODS.

[37]  Christian S. Jensen,et al.  Google fusion tables: web-centered data management and collaboration , 2010, SIGMOD Conference.

[38]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[39]  Danyel Fisher,et al.  Incremental, approximate database queries and uncertainty for exploratory visualization , 2011, 2011 IEEE Symposium on Large Data Analysis and Visualization.

[40]  Monica M. C. Schraefel,et al.  Trust me, i'm partially right: incremental visualization lets analysts explore large datasets faster , 2012, CHI.

[41]  Carlos Agón,et al.  Time-series data mining , 2012, CSUR.

[42]  Daniel Perry,et al.  VizDeck: self-organizing dashboards for visual analytics , 2012, SIGMOD Conference.

[43]  Pat Hanrahan Analytic database technologies for a new kind of user: the data enthusiast , 2012, SIGMOD Conference.

[44]  Jeffrey Heer,et al.  Profiler: integrated statistical analysis and visualization for data quality assessment , 2012, AVI.

[45]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[46]  Ramon Lawrence,et al.  Time series compression for adaptive chart generation , 2013, 2013 26th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE).

[47]  Aditya G. Parameswaran,et al.  SeeDB: visualizing database queries efficiently , 2013, VLDB 2013.

[48]  Jeffrey Heer,et al.  imMens: Real‐time Visual Querying of Big Data , 2013, Comput. Graph. Forum.

[49]  Samuel Madden NEEDLETAIL: A System for Browsing Queries , 2014 .

[50]  Ronitt Rubinfeld,et al.  Rapid Sampling for Visualizations with Ordering Guarantees , 2014, Proc. VLDB Endow..