Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data

We present a new system, called Searchlight, that uniquely integrates constraint solving and data management techniques. It allows Constraint Programming (CP) machinery to run efficiently inside a DBMS without the need to extract, transform and move the data. This marriage concurrently offers the rich expressiveness and efficiency of constraint-based search and optimization provided by modern CP solvers, and the ability of DBMSs to store and query data at scale, resulting in an enriched functionality that can effectively support both data- and search-intensive applications. As such, Searchlight is the first system to support generic search, exploration and mining over large multi-dimensional data collections, going beyond point algorithms designed for point search and mining tasks. Searchlight makes the following scientific contributions: • Constraint solvers as first-class citizens Instead of treating solver logic as a black-box, Searchlight provides native support, incorporating the necessary APIs for its specification and transparent execution as part of query plans, as well as novel algorithms for its optimized execution and parallelization. • Speculative solving Existing solvers assume that the entire data set is main-memory resident. Searchlight uses an innovative two stage Solve-Validate approach that allows it to operate speculatively yet safely on main-memory synopses, quickly producing candidate search results that can later be efficiently validated on real data. • Computation and I/O load balancing As CP solver logic can be computationally expensive, executing it on large search and data spaces requires novel CPU-I/O balancing approaches when performing search distribution. We built a prototype implementation of Searchlight on Google's Or-Tools, an open-source suite of operations research tools, and the array DBMS SciDB. Extensive experimental results show that Searchlight often performs orders of magnitude faster than the next best approach (SciDB-only or CP-solver-only) in terms of end response time and time to first result.

[1]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[2]  Sharad Mehrotra,et al.  Progressive approximate aggregate queries with a multi-resolution tree structure , 2001, SIGMOD '01.

[3]  Carlos Eduardo Scheidegger,et al.  Nanocubes for Real-Time Exploration of Spatiotemporal Datasets , 2013, IEEE Transactions on Visualization and Computer Graphics.

[4]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[5]  Shashi Shekhar,et al.  Spatial Databases: A Tour , 2003 .

[6]  Jean-Charles Régin,et al.  Embarrassingly Parallel Search , 2013, CP.

[7]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[8]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[9]  Theodosios Pavlidis,et al.  A hierarchical data structure for picture processing , 1975 .

[10]  Dimitris Papadias,et al.  Processing fuzzy spatial queries: a configuration similarity approach , 1999, Int. J. Geogr. Inf. Sci..

[11]  Pascal Van Hentenryck,et al.  Parallelizing Constraint Programs Transparently , 2007, CP.

[12]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[13]  Dan Suciu,et al.  Tiresias: the database oracle for how-to queries , 2012, SIGMOD Conference.

[14]  Doron Rotem,et al.  Random Sampling from Database Files: A Survey , 1990, SSDBM.

[15]  Yannis Sismanis,et al.  Dwarf: shrinking the PetaCube , 2002, SIGMOD '02.

[16]  Alexander S. Szalay,et al.  The Sloan Digital Sky Survey , 1999, Comput. Sci. Eng..

[17]  Toby Walsh,et al.  Handbook of Constraint Programming , 2006, Handbook of Constraint Programming.

[18]  Nimrod Megiddo,et al.  Discovery-Driven Exploration of OLAP Data Cubes , 1998, EDBT.

[19]  Panagiotis Manolios,et al.  ILP Modulo Data , 2014, 2014 Formal Methods in Computer-Aided Design (FMCAD).

[20]  Tok Wang Ling,et al.  Qualitative Spatial Relationships Representation IO&T and its Retrieval , 1998, DEXA.

[21]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[22]  Peter J. Stuckey,et al.  Confidence-Based Work Stealing in Parallel Constraint Programming , 2009, CP.

[23]  J. Gunn,et al.  The Sloan Digital Sky Survey , 1994, astro-ph/9412080.

[24]  Stanley B. Zdonik,et al.  Interactive data exploration using semantic windows , 2014, SIGMOD Conference.

[25]  Xintao Wu,et al.  Supporting Online Queries in ROLAP , 2000, DaWaK.

[26]  Walid G. Aref,et al.  Efficient processing of window queries in the pyramid data structure , 1990, PODS '90.