AIDE: An Active Learning-Based Approach for Interactive Data Exploration

In this paper, we argue that database systems be augmented with an automated data exploration service that methodically steers users through the data in a meaningful way. Such an automated system is crucial for deriving insights from complex datasets found in many big data applications such as scientific and healthcare applications as well as for reducing the human effort of data exploration. Towards this end, we present AIDE, an Automatic Interactive Data Exploration framework that assists users in discovering new interesting data patterns and eliminate expensive ad-hoc exploratory queries. AIDE relies on a seamless integration of classification algorithms and data management optimization techniques that collectively strive to accurately learn the user interests based on his relevance feedback on strategically collected samples. We present a number of exploration techniques as well as optimizations that minimize the number of samples presented to the user while offering interactive performance. AIDE can deliver highly accurate query predictions for very common conjunctive queries with small user effort while, given a reasonable number of samples, it can predict with high accuracy complex disjunctive queries. It provides interactive performance as it limits the user wait time per iteration of exploration to less than a few seconds.

[1]  Gautam Das,et al.  A Probabilistic Optimization Framework for the Empty-Answer Problem , 2013, Proc. VLDB Endow..

[2]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Robert C. Holte,et al.  Decision Tree Instability and Active Learning , 2007, ECML.

[4]  Martin L. Kersten,et al.  Meet Charles, big data query advisor , 2013, CIDR.

[5]  Moshé M. Zloof Query-by-example: the invocation and definition of tables and forms , 1975, VLDB '75.

[6]  Hermann Ney,et al.  Learning weighted distances for relevance feedback in image retrieval , 2008, 2008 19th International Conference on Pattern Recognition.

[7]  Aditya G. Parameswaran,et al.  Smart Drill-Down: A New Data Exploration Operator , 2015, Proc. VLDB Endow..

[8]  Surajit Chaudhuri Generalization and a framework for query modification , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[9]  Stanley B. Zdonik,et al.  Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data , 2015, Proc. VLDB Endow..

[10]  Zhan Li,et al.  AIDE: An Automatic User Navigation System for Interactive Data Exploration , 2015, Proc. VLDB Endow..

[11]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[12]  Olga Papaemmanouil,et al.  Explore-by-example: an automatic query steering framework for interactive data exploration , 2014, SIGMOD Conference.

[13]  Anthony K. H. Tung,et al.  Relaxing join and selection queries , 2006, VLDB.

[14]  Martin L. Kersten,et al.  SciBORQ: Scientific data management with Bounds On Runtime and Quality , 2011, CIDR.

[15]  Neoklis Polyzotis,et al.  Query Recommendations for Interactive Database Exploration , 2009, SSDBM.

[16]  David Maier,et al.  Query From Examples: An Iterative, Data-Driven Approach to Query Construction , 2015, Proc. VLDB Endow..

[17]  Abdul Wasay,et al.  Queriosity: Automated Data Exploration , 2015, 2015 IEEE International Congress on Big Data.

[18]  Dan Suciu,et al.  A Case for A Collaborative Query Management System , 2009, CIDR.

[19]  Mounia Lalmas,et al.  A survey on the use of relevance feedback for information access systems , 2003, The Knowledge Engineering Review.

[20]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[21]  Vagelis Hristidis,et al.  FACeTOR: cost-driven exploration of faceted query results , 2010, CIKM.

[22]  Stanley B. Zdonik,et al.  Query Steering for Interactive Data Exploration , 2013, CIDR.

[23]  Doron Rotem,et al.  Random sampling from databases: a survey , 1995 .

[24]  Stanley B. Zdonik,et al.  Interactive data exploration using semantic windows , 2014, SIGMOD Conference.

[25]  Juliana Freire,et al.  Supporting Exploratory Queries in Databases , 2004, DASFAA.

[26]  Arnab Nandi,et al.  SnapToQuery: Providing Interactive Feedback during Exploratory Query Specification , 2015, Proc. VLDB Endow..

[27]  Lu Wang,et al.  Clustering query refinements by user intent , 2010, WWW '10.

[28]  Nick Koudas,et al.  Interactive query refinement , 2009, EDBT '09.

[29]  Edward Y. Chang,et al.  Active learning in very large databases , 2006, Multimedia Tools and Applications.

[30]  Angela Bonifati,et al.  Interactive Inference of Join Queries , 2014, EDBT.

[31]  Themis Palpanas,et al.  Exemplar Queries: Give me an Example of What You Need , 2014, Proc. VLDB Endow..

[32]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[33]  Jingbo Zhu,et al.  Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem , 2007, EMNLP.

[34]  Abraham Silberschatz,et al.  Learning and verifying quantified boolean queries by example , 2013, PODS '13.

[35]  Surajit Chaudhuri,et al.  Discovering queries based on example tuples , 2014, SIGMOD Conference.

[36]  Evaggelia Pitoura,et al.  YmalDB: exploring relational databases via result-driven recommendations , 2013, The VLDB Journal.

[37]  Philip A. Pinto,et al.  The Large Synoptic Survey Telescope , 2006 .

[38]  Martin L. Kersten,et al.  The researcher's guide to the data deluge , 2011, Proc. VLDB Endow..

[39]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[40]  Thomas S. Huang,et al.  Relevance feedback in image retrieval: A comprehensive review , 2003, Multimedia Systems.

[41]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[42]  Arnab Nandi,et al.  Distributed and interactive cube exploration , 2014, 2014 IEEE 30th International Conference on Data Engineering.