Meet Charles, big data query advisor

In scientific data management and business analytics, the most informative queries are a holy grail. Data collection becomes increasingly simpler, yet data exploration gets significantly harder. Exploratory querying is likely to return an empty or an overwhelming result set. On the other hand, data mining algorithms require extensive preparation, ample time and do not scale well. In this paper, we address this challenge at its core, i.e., how to query the query space associated with a given database. The space considered is formed by conjunctive predicates. To express them, we introduce the Segmentation Description Language (SDL). The user provides a query. Charles, our query advisory system, breaks its extent into meaningful segments and returns the subsequent SDL descriptions. This provides insight into the set described and offers the user directions for further exploration. We introduce a novel algorithm to generate SDL answers. We evaluate them using four orthogonal criteria: homogeneity, simplicity, breadth, and entropy. A prototype implementation has been constructed and the landscape of follow-up research is sketched.

[1]  Nimrod Megiddo,et al.  Discovery-Driven Exploration of OLAP Data Cubes , 1998, EDBT.

[2]  Divesh Srivastava,et al.  Summarizing Relational Databases , 2009, Proc. VLDB Endow..

[3]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[4]  Marti A. Hearst,et al.  Flexible Search and Navigation using Faceted Metadata , 2002 .

[5]  Mukesh K. Mohania,et al.  DynaCet: Building Dynamic Faceted Search Systems over Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[6]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[7]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[8]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[9]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[10]  Eugene J. Shekita,et al.  Beyond basic faceted search , 2008, WSDM '08.

[11]  Torild van Eck The ORFEUS Seismological Software Library , 1997 .

[12]  Neoklis Polyzotis,et al.  Query Recommendations for Interactive Database Exploration , 2009, SSDBM.

[13]  Jun Rao,et al.  Dynamic faceted search for discovery-driven analysis , 2008, CIKM '08.

[14]  Dan Suciu,et al.  SnipSuggest: Context-Aware Autocompletion for SQL , 2010, Proc. VLDB Endow..

[15]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[16]  Stefano Lodi Data clustering I , 2009 .

[17]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[18]  Berthold Reinwald,et al.  Towards keyword-driven analytical processing , 2007, SIGMOD '07.

[19]  Jilles Vreeken,et al.  Tell me what i need to know: succinctly summarizing data with itemsets , 2011, KDD.

[20]  Martin L. Kersten,et al.  Database Architecture Evolution: Mammals Flourished long before Dinosaurs became Extinct , 2009, Proc. VLDB Endow..

[21]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.