AIDE: An Automated Sample-based Approach for Interactive Data Exploration

In this paper, we argue that database systems be augmented with an automated data exploration service that methodically steers users through the data in a meaningful way. Such an automated system is crucial for deriving insights from complex datasets found in many big data applications such as scientific and healthcare applications as well as for reducing the human effort of data exploration. Towards this end, we present AIDE, an Automatic Interactive Data Exploration framework that assists users in discovering new interesting data patterns and eliminate expensive ad-hoc exploratory queries. AIDE relies on a seamless integration of classification algorithms and data management optimization techniques that collectively strive to accurately learn the user interests based on his relevance feedback on strategically collected samples. We present a number of exploration techniques as well as optimizations that minimize the number of samples presented to the user while offering interactive performance. AIDE can deliver highly accurate query predictions for very common conjunctive queries with small user effort while, given a reasonable number of samples, it can predict with high accuracy complex disjunctive queries. It provides interactive performance as it limits the user wait time per iteration of exploration to less than a few seconds.

[1]  Themis Palpanas,et al.  Exemplar Queries: Give me an Example of What You Need , 2014, Proc. VLDB Endow..

[2]  Stanley B. Zdonik,et al.  Interactive data exploration using semantic windows , 2014, SIGMOD Conference.

[3]  Surajit Chaudhuri Generalization and a framework for query modification , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[4]  Juliana Freire,et al.  Supporting Exploratory Queries in Databases , 2004, DASFAA.

[5]  Anthony K. H. Tung,et al.  Relaxing join and selection queries , 2006, VLDB.

[6]  Arnab Nandi,et al.  SnapToQuery: Providing Interactive Feedback during Exploratory Query Specification , 2015, Proc. VLDB Endow..

[7]  Lu Wang,et al.  Clustering query refinements by user intent , 2010, WWW '10.

[8]  Olga Papaemmanouil,et al.  Explore-by-example: an automatic query steering framework for interactive data exploration , 2014, SIGMOD Conference.

[9]  Martin L. Kersten,et al.  SciBORQ: Scientific data management with Bounds On Runtime and Quality , 2011, CIDR.

[10]  Abdul Wasay,et al.  Queriosity: Automated Data Exploration , 2015, 2015 IEEE International Congress on Big Data.

[11]  Dan Suciu,et al.  A Case for A Collaborative Query Management System , 2009, CIDR.

[12]  Mounia Lalmas,et al.  A survey on the use of relevance feedback for information access systems , 2003, The Knowledge Engineering Review.

[13]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[14]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[15]  Thomas S. Huang,et al.  Relevance feedback in image retrieval: A comprehensive review , 2003, Multimedia Systems.

[16]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[17]  Jingbo Zhu,et al.  Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem , 2007, EMNLP.

[18]  Vaibhav Patil,et al.  Query Recommendations for Interactive Database Exploration , 2015 .

[19]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[20]  Angela Bonifati,et al.  Interactive Inference of Join Queries , 2014, EDBT.

[21]  Martin L. Kersten,et al.  Meet Charles, big data query advisor , 2013, CIDR.

[22]  Edward Y. Chang,et al.  Active learning in very large databases , 2006, Multimedia Tools and Applications.

[23]  Hermann Ney,et al.  Learning weighted distances for relevance feedback in image retrieval , 2008, 2008 19th International Conference on Pattern Recognition.

[24]  Stanley B. Zdonik,et al.  Query Steering for Interactive Data Exploration , 2013, CIDR.

[25]  Abraham Silberschatz,et al.  Learning and verifying quantified boolean queries by example , 2013, PODS '13.

[26]  Moshé M. Zloof Query-by-example: the invocation and definition of tables and forms , 1975, VLDB '75.

[27]  Doron Rotem,et al.  Random sampling from databases: a survey , 1995 .

[28]  David Maier,et al.  Query From Examples: An Iterative, Data-Driven Approach to Query Construction , 2015, Proc. VLDB Endow..

[29]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Nick Koudas,et al.  Interactive query refinement , 2009, EDBT '09.

[31]  Stanley B. Zdonik,et al.  Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data , 2015, Proc. VLDB Endow..

[32]  Aditya G. Parameswaran,et al.  Smart Drill-Down: A New Data Exploration Operator , 2015, Proc. VLDB Endow..

[33]  Arnab Nandi,et al.  Distributed and interactive cube exploration , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[34]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[35]  Surajit Chaudhuri,et al.  Discovering queries based on example tuples , 2014, SIGMOD Conference.

[36]  Evaggelia Pitoura,et al.  YmalDB: exploring relational databases via result-driven recommendations , 2013, The VLDB Journal.

[37]  Martin L. Kersten,et al.  The researcher's guide to the data deluge , 2011, Proc. VLDB Endow..