RSATree: Distribution-Aware Data Representation of Large-Scale Tabular Datasets for Flexible Visual Query

Analysts commonly investigate the data distributions derived from statistical aggregations of data that are represented by charts, such as histograms and binned scatterplots, to visualize and analyze a large-scale dataset. Aggregate queries are implicitly executed through such a process. Datasets are constantly extremely large; thus, the response time should be accelerated by calculating predefined data cubes. However, the queries are limited to the predefined binning schema of preprocessed data cubes. Such limitation hinders analysts' flexible adjustment of visual specifications to investigate the implicit patterns in the data effectively. Particularly, RSATree enables arbitrary queries and flexible binning strategies by leveraging three schemes, namely, an R-tree-based space partitioning scheme to catch the data distribution, a locality-sensitive hashing technique to achieve locality-preserving random access to data items, and a summed area table scheme to support interactive query of aggregated values with a linear computational complexity. This study presents and implements a web-based visual query system that supports visual specification, query, and exploration of large-scale tabular data with user-adjustable granularities. We demonstrate the efficiency and utility of our approach by performing various experiments on real-world datasets and analyzing time and space complexity.

[1]  Arnab Nandi,et al.  Distributed and interactive cube exploration , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[2]  Shantanu H. Joshi,et al.  Materialized Sample Views for Database Approximation , 2008, IEEE Trans. Knowl. Data Eng..

[3]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[4]  Fatih Murat Porikli,et al.  Integral histogram: a fast way to extract histograms in Cartesian spaces , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Monica M. C. Schraefel,et al.  Trust me, i'm partially right: incremental visualization lets analysts explore large datasets faster , 2012, CHI.

[6]  Hujun Bao,et al.  Adaptively Exploring Population Mobility Patterns in Flow Visualization , 2017, IEEE Transactions on Intelligent Transportation Systems.

[7]  Ronitt Rubinfeld,et al.  I've Seen "Enough": Incrementally Improving Visualizations to Support Rapid Decision Making , 2017, Proc. VLDB Endow..

[8]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[9]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[10]  Wei Chen,et al.  ECharts: A declarative framework for rapid construction of web-based visualization , 2018, Vis. Informatics.

[11]  Jeffrey Heer,et al.  Falcon: Balancing Interactive Latency and Resolution Sensitivity for Scalable Linked Visualizations , 2019, CHI.

[12]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[13]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[14]  Fang Hao,et al.  Medical image denoising by parallel non-local means , 2016, Neurocomputing.

[15]  Tiziana Catarci,et al.  Visual Query Systems for Databases: A Survey , 1997, J. Vis. Lang. Comput..

[16]  Steven F. Roth,et al.  Visage: a user interface environment for exploring information , 1996, Proceedings IEEE Symposium on Information Visualization '96.

[17]  Mark Sullivan,et al.  Quasi-cubes: exploiting approximations in multidimensional databases , 1997, SGMD.

[18]  Han-Wei Shen,et al.  Transformations for volumetric range distribution queries , 2013, 2013 IEEE Pacific Visualization Symposium (PacificVis).

[19]  Fei-Yue Wang,et al.  A Survey of Traffic Data Visualization , 2015, IEEE Transactions on Intelligent Transportation Systems.

[20]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[21]  Kanit Wongsuphasawat,et al.  Voyager: Exploratory Analysis via Faceted Browsing of Visualization Recommendations , 2016, IEEE Transactions on Visualization and Computer Graphics.

[22]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[23]  Chris Jermaine,et al.  The Sort-Merge-Shrink join , 2006, TODS.

[24]  Ross Maciejewski,et al.  VAUD: A Visual Analysis Approach for Exploring Spatio-Temporal Urban Data , 2018, IEEE Transactions on Visualization and Computer Graphics.

[25]  Franklin C. Crow,et al.  Summed-area tables for texture mapping , 1984, SIGGRAPH.

[26]  Abon Chaudhuri,et al.  Efficient Range Distribution Query for Visualizing Scientific Data , 2014, 2014 IEEE Pacific Visualization Symposium.

[27]  Pat Hanrahan,et al.  Polaris: a system for query, analysis and visualization of multi-dimensional relational databases , 2000, IEEE Symposium on Information Visualization 2000. INFOVIS 2000. Proceedings.

[28]  Kun Zhou,et al.  Visual Abstraction and Exploration of Multi-class Scatterplots , 2014, IEEE Transactions on Visualization and Computer Graphics.

[29]  Wei Chen,et al.  The design space of construction tools for information visualization: A survey , 2017, J. Vis. Lang. Comput..

[30]  Steven F. Roth,et al.  An Interactive Visualization Environment for Data Exploration , 1997, KDD.

[31]  Zhe Wang,et al.  Gaussian Cubes: Real-Time Modeling for Visual Exploration of Large Multidimensional Datasets , 2017, IEEE Transactions on Visualization and Computer Graphics.

[32]  Jeffrey Heer,et al.  The Effects of Interactive Latency on Exploratory Visual Analysis , 2014, IEEE Transactions on Visualization and Computer Graphics.

[33]  Han-Wei Shen,et al.  Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform , 2013, IEEE Transactions on Visualization and Computer Graphics.

[34]  Minfeng Zhu,et al.  Location2vec: A Situation-Aware Representation for Visual Exploration of Urban Locations , 2019, IEEE Transactions on Intelligent Transportation Systems.

[35]  Jeffrey Heer,et al.  D³ Data-Driven Documents , 2011, IEEE Transactions on Visualization and Computer Graphics.

[36]  Wei Chen,et al.  Visual subspace clustering based on dimension relevance , 2017, J. Vis. Lang. Comput..

[37]  Tim Kraska,et al.  How Progressive Visualizations Affect Exploratory Analysis , 2017, IEEE Transactions on Visualization and Computer Graphics.

[38]  Feng Luo,et al.  Evaluating Multi-Dimensional Visualizations for Understanding Fuzzy Clusters , 2019, IEEE Transactions on Visualization and Computer Graphics.

[39]  Viswanath Poosala,et al.  Fast approximate query answering using precomputed statistics , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[40]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[41]  Zhigang Deng,et al.  Collective Crowd Formation Transform with Mutual Information–Based Runtime Feedback , 2015, Comput. Graph. Forum.

[42]  Yong Gan,et al.  Traffic Simulation and Visual Verification in Smog , 2019, ACM Trans. Intell. Syst. Technol..

[43]  Aditya G. Parameswaran,et al.  SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics , 2015, Proc. VLDB Endow..

[44]  Jean-Daniel Fekete,et al.  Progressive Analytics: A Computation Paradigm for Exploratory Data Analysis , 2016, ArXiv.

[45]  Michael J. Cafarella,et al.  Visualization-aware sampling for very large databases , 2015, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[46]  Jeffrey Heer,et al.  Profiler: integrated statistical analysis and visualization for data quality assessment , 2012, AVI.

[47]  Jon Louis Bentley,et al.  Multidimensional Binary Search Trees in Database Applications , 1979, IEEE Transactions on Software Engineering.

[48]  Giuseppe Santucci,et al.  Give Chance a Chance: Modeling Density to Enhance Scatter Plot Quality through Random Data Sampling , 2006, Inf. Vis..

[49]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[50]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[51]  Michael Stonebraker,et al.  Dynamic Prefetching of Data Tiles for Interactive Visualization , 2016, SIGMOD Conference.

[52]  Christopher Ahlberg,et al.  Spotfire: an information exploration environment , 1996, SGMD.

[53]  Nadir Weibel,et al.  Embedded Merge & Split: Visual Adjustment of Data Grouping , 2019, IEEE Transactions on Visualization and Computer Graphics.

[54]  Christopher G. Healey,et al.  Interest Driven Navigation in Visualization , 2012, IEEE Transactions on Visualization and Computer Graphics.

[55]  Fatih Korkmaz,et al.  Feedback-driven interactive exploration of large multidimensional data supported by visual classifier , 2014, 2014 IEEE Conference on Visual Analytics Science and Technology (VAST).

[56]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[57]  Ben Shneiderman,et al.  Dynamic queries for visual information seeking , 1994, IEEE Software.

[58]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[59]  Jeffrey Heer,et al.  imMens: Real‐time Visual Querying of Big Data , 2013, Comput. Graph. Forum.

[60]  Carlos Eduardo Scheidegger,et al.  Nanocubes for Real-Time Exploration of Spatiotemporal Datasets , 2013, IEEE Transactions on Visualization and Computer Graphics.

[61]  Anselmo Lastra,et al.  Fast Summed‐Area Table Generation and its Applications , 2005, Comput. Graph. Forum.

[62]  Zhiguang Zhou,et al.  Visual Abstraction of Large Scale Geospatial Origin-Destination Movement Data , 2019, IEEE Transactions on Visualization and Computer Graphics.

[63]  Ross Maciejewski,et al.  Exploring the Sensitivity of Choropleths under Attribute Uncertainty , 2020, IEEE Transactions on Visualization and Computer Graphics.

[64]  Carlos Eduardo Scheidegger,et al.  Hashedcubes: Simple, Low Memory, Real-Time Visual Exploration of Big Data , 2017, IEEE Transactions on Visualization and Computer Graphics.

[65]  Sharad Mehrotra,et al.  Progressive approximate aggregate queries with a multi-resolution tree structure , 2001, SIGMOD '01.

[66]  Hee-Kap Ahn,et al.  A survey on multidimensional access methods , 2001 .

[67]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[68]  Wei Chen,et al.  A User Study on the Capability of Three Geo-Based Features in Analyzing and Locating Trajectories , 2019, IEEE Transactions on Intelligent Transportation Systems.

[69]  Ronitt Rubinfeld,et al.  Rapid Sampling for Visualizations with Ordering Guarantees , 2014, Proc. VLDB Endow..

[70]  Jeffrey Heer,et al.  SpanningAspectRatioBank Easing FunctionS ArrayIn ColorIn Date Interpolator MatrixInterpola NumObjecPointI Rectang ISchedu Parallel Pause Scheduler Sequen Transition Transitioner Transiti Tween Co DelimGraphMLCon IData JSONCon DataField DataSc Dat DataSource Data DataUtil DirtySprite LineS RectSprite , 2011 .

[71]  Anthony K. H. Tung,et al.  LDSScanner: Exploratory Analysis of Low-Dimensional Structures in High-Dimensional Datasets , 2018, IEEE Transactions on Visualization and Computer Graphics.

[72]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.