Spatial Online Sampling and Aggregation

The massive adoption of smart phones and other mobile devices has generated humongous amount of spatial and spatio-temporal data. The importance of spatial analytics and aggregation is ever-increasing. An important challenge is to support interactive exploration over such data. However, spatial analytics and aggregation using all data points that satisfy a query condition is expensive, especially over large data sets, and could not meet the needs of interactive exploration. To that end, we present novel indexing structures that support spatial online sampling and aggregation on large spatial and spatio-temporal data sets. In spatial online sampling, random samples from the set of spatial (or spatio-temporal) points that satisfy a query condition are generated incrementally in an online fashion. With more and more samples, various spatial analytics and aggregations can be performed in an online, interactive fashion, with estimators that have better accuracy over time. Our design works well for both memory-based and disk-resident data sets, and scales well towards different query and sample sizes. More importantly, our structures are dynamic, hence, they are able to deal with insertions and deletions efficiently. Extensive experiments on large real data sets demonstrate the improvements achieved by our indexing structures compared to other baseline methods.

[1]  Jayant Madhavan,et al.  Efficient spatial sampling of large geographical tables , 2012, SIGMOD Conference.

[2]  Lars Arge,et al.  The Buffer Tree: A Technique for Designing Batched External Data Structures , 2003, Algorithmica.

[3]  Yufei Tao,et al.  Independent range sampling , 2014, PODS.

[4]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[5]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[6]  Doron Rotem,et al.  Random Sampling from B+ Trees , 1989, VLDB.

[7]  Suman Nath,et al.  Online maintenance of very large random samples on flash storage , 2009, The VLDB Journal.

[8]  Wolfgang Lehner,et al.  Derby/S: a DBMS for sample-based query answering , 2006, SIGMOD Conference.

[9]  Chris Jermaine,et al.  A Novel Index Supporting High Volume Data Warehouse Insertion , 1999, VLDB.

[10]  Peter J. Haas,et al.  The monte carlo database system: Stochastic analysis close to the data , 2011, TODS.

[11]  Peter J. Haas,et al.  Large-sample and deterministic confidence intervals for online aggregation , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[12]  Chris Jermaine,et al.  A Sampling Algebra for Aggregate Estimation , 2013, Proc. VLDB Endow..

[13]  Lu Wang,et al.  Indexing for summary queries , 2014, ACM Trans. Database Syst..

[14]  Zheng Zhang,et al.  Error-bounded Sampling for Analytics on Big Sparse Data , 2014, Proc. VLDB Endow..

[15]  Chris Jermaine,et al.  Scalable approximate query processing with the DBO engine , 2008, TODS.

[16]  Ion Stoica,et al.  Blink and It's Done: Interactive Queries on Very Large Data , 2012, Proc. VLDB Endow..

[17]  Chris Jermaine,et al.  Online maintenance of very large random samples , 2004, SIGMOD '04.

[18]  Leonardo Guerreiro Azevedo,et al.  Approximate Query Processing in Spatial Databases Using Raster Signatures , 2006, GEOINFO.

[19]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[20]  Carlo Zaniolo,et al.  The analytical bootstrap: a new method for fast error estimation in approximate query processing , 2014, SIGMOD Conference.

[21]  Wei-Ying Ma,et al.  Understanding mobility based on GPS data , 2008, UbiComp.

[22]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[23]  Fei Xu,et al.  Confidence bounds for sampling-based group by estimates , 2008, TODS.

[24]  Chris Jermaine,et al.  Materialized Sample Views for Database Approximation , 2008, IEEE Transactions on Knowledge and Data Engineering.

[25]  Barbara Catania,et al.  Approximate Queries for Spatial Data , 2013, Advanced Query Processing.

[26]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[27]  Doron Rotem,et al.  Sampling from spatial databases , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[28]  Wolfgang Lehner,et al.  Deferred Maintenance of Disk-Based Random Samples , 2006, EDBT.

[29]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[30]  Chris Jermaine,et al.  Online aggregation for large MapReduce jobs , 2011, Proc. VLDB Endow..

[31]  Arnab Nandi,et al.  Combining User Interaction, Speculative Query Execution and Sampling in the DICE System , 2014, Proc. VLDB Endow..

[32]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[33]  Carlo Zaniolo,et al.  ABS: a system for scalable approximate queries with accuracy guarantees , 2014, SIGMOD Conference.

[34]  Patrick Weber,et al.  OpenStreetMap: User-Generated Street Maps , 2008, IEEE Pervasive Computing.

[35]  Mark de Berg,et al.  The Priority R-tree: a practically efficient and worst-case optimal R-tree , 2004, SIGMOD '04.

[36]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.