Interactive data exploration using semantic windows

We present a new interactive data exploration approach, called Semantic Windows (SW), in which users query for multidimensional "windows" of interest via standard DBMS-style queries enhanced with exploration constructs. Users can specify SWs using (i) shape-based properties, e.g., "identify all 3-by-3 windows", as well as (ii) content-based properties, e.g., "identify all windows in which the average brightness of stars exceeds 0.8". This SW approach enables the interactive processing of a host of useful exploratory queries that are difficult to express and optimize using standard DBMS techniques. SW uses a sampling-guided, data-driven search strategy to explore the underlying data set and quickly identify windows of interest. To facilitate human-in-the-loop style interactive processing, SW is optimized to produce online results during query execution. To control the tension between online performance and query completion time, it uses a tunable, adaptive prefetching technique. To enable exploration of big data, the framework supports distributed computation. We describe the semantics and implementation of SW as a distributed layer on top of PostgreSQL. The experimental results with real astronomical and artificial data reveal that SW can offer online results quickly and continuously with little or no degradation in query completion times.

[1]  Sean N. Raymond,et al.  White Dwarf - M Dwarf Pairs in the Sloan Digital Sky Survey (SDSS) , 2002 .

[2]  Jeffrey F. Naughton,et al.  Generalized Search Trees for Database Systems , 1995, VLDB.

[3]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[4]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[5]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[6]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[7]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[8]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[9]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[10]  Martin L. Kersten,et al.  SciBORQ: Scientific data management with Bounds On Runtime and Quality , 2011, CIDR.

[11]  Doron Rotem,et al.  Random Sampling from Database Files: A Survey , 1990, SSDBM.

[12]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[13]  A. Land,et al.  An Automatic Method for Solving Discrete Programming Problems , 1960, 50 Years of Integer Programming.

[14]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[15]  J. Gunn,et al.  The Sloan Digital Sky Survey , 1994, astro-ph/9412080.

[16]  Judea Pearl,et al.  Heuristics : intelligent search strategies for computer problem solving , 1984 .

[17]  Elke A. Rundensteiner,et al.  Progressive result generation for multi-criteria decision support queries , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[18]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[19]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multi-Key File Structure , 1981, ECI.

[20]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[21]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[22]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[23]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[24]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.