Identifying User Interests within the Data Space - a Case Study with SkyServer

Many scientific databases nowadays are publicly available for querying and advanced data analytics. One prominent example is the Sloan Digital Sky Survey (SDSS)—SkyServer, which offers data to astronomers, scientists, and the general public. For such data it is important to understand the public focus, and trending research directions on the subject described by the database, i.e., astronomy in the case of SkyServer. With a large user base, it is worthwhile to identify the areas of the data space that are of interest to users. In this paper, we study the problem of extracting and analyzing access areas of user queries, by analyzing the query logs of the database. To our knowledge, both the concept of access areas and how to extract them have not been studied before. We address this by first proposing a novel notion of access area, which is independent of any specific database state. It allows the detection of interesting areas within the data space, regardless if they already exist in the database content. Second, we present a detailed mapping of our notion to different query types. Using our mapping on the SkyServer query log, we obtain a transformed data set. Third, we aggregate similar overlapping queries by DBSCAN and gain an abstraction from the raw query log. Finally, we arrive at clusters of access areas that are interesting from the perspective of an astronomer. These clusters occupy only a small fraction (in some cases less than 1%) of the data space and contain queries issued by many users. Some frequently accessed areas even do not exist in the space spanned by available objects.

[1]  Georg Gottlob,et al.  Translating SQL Into Relational Algebra: Optimization, Semantics, and Equivalence of SQL Queries , 1985, IEEE Transactions on Software Engineering.

[2]  Hamid Pirahesh,et al.  Extensible/rule based query rewrite optimization in Starburst , 1992, SIGMOD '92.

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  Anupam Joshi,et al.  Warehousing and mining Web logs , 1999, WIDM '99.

[5]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[6]  Jayant R. Haritsa,et al.  Plan Selection Based on Query Clustering , 2002, VLDB.

[7]  Xiangji Huang,et al.  Finding and Analyzing Database User Sessions , 2005, DASFAA.

[8]  Anthony K. H. Tung,et al.  Relaxing join and selection queries , 2006, VLDB.

[9]  Ralf Rantzau,et al.  Context-sensitive ranking , 2006, SIGMOD Conference.

[10]  Alexander S. Szalay,et al.  SkyServer Traffic Report - The First Five Years , 2007, ArXiv.

[11]  Anthony Cleve,et al.  Dynamic Analysis of SQL Statements for Data-Intensive Applications Reverse Engineering , 2008, 2008 15th Working Conference on Reverse Engineering.

[12]  Yannis E. Ioannidis From Databases to Natural Language: The Unusual Direction , 2008, NLDB.

[13]  Divesh Srivastava,et al.  Recommending Join Queries via Query Log Analysis , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[14]  Arnaud Giacometti,et al.  Recommending Multidimensional Queries , 2009, DaWaK.

[15]  Neoklis Polyzotis,et al.  QueRIE: A Query Recommender System Supporting Interactive Database Exploration , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[16]  Fabrizio Silvestri,et al.  Mining Query Logs: Turning Search Usage Data into Knowledge , 2010, Found. Trends Inf. Retr..

[17]  Neoklis Polyzotis,et al.  SQL QueRIE recommendations , 2010, Proc. VLDB Endow..

[18]  Georgia Koutrika,et al.  Explaining structured queries in natural language , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[19]  Neoklis Polyzotis,et al.  The QueRIE system for Personalized Query Recommendations , 2011, IEEE Data Eng. Bull..

[20]  Wolfgang Gatterbauer Databases will visualize queries too , 2011, Proc. VLDB Endow..

[21]  Matteo Golfarelli,et al.  Mining Preferences from OLAP Query Logs for Proactive Personalization , 2011, ADBIS.

[22]  Torsten Grust,et al.  True language-level SQL debugging , 2011, EDBT/ICDT '11.

[23]  Matteo Golfarelli,et al.  Similarity measures for OLAP sessions , 2013, Knowledge and Information Systems.

[24]  Vaibhav Patil,et al.  Query Recommendations for Interactive Database Exploration , 2015 .