Extracting Top-K Insights from Multi-dimensional Data

OLAP tools have been extensively used by enterprises to make better and faster decisions. Nevertheless, they require users to specify group-by attributes and know precisely what they are looking for. This paper takes the first attempt towards automatically extracting top-k insights from multi-dimensional data. This is useful not only for non-expert users, but also reduces the manual effort of data analysts. In particular, we propose the concept of insight which captures interesting observation derived from aggregation results in multiple steps (e.g., rank by a dimension, compute the percentage of measure by a dimension). An example insight is: ``Brand B's rank (across brands) falls along the year, in terms of the increase in sales''. Our problem is to compute the top-k insights by a score function. It poses challenges on (i) the effectiveness of the result and (ii) the efficiency of computation. We propose a meaningful scoring function for insights to address (i). Then, we contribute a computation framework for top-k insights, together with a suite of optimization techniques (i.e., pruning, ordering, specialized cube, and computation sharing) to address (ii). Our experimental study on both real data and synthetic data verifies the effectiveness and efficiency of our proposed solution.

[1]  Martin L. Kersten,et al.  Meet Charles, big data query advisor , 2013, CIDR.

[2]  Surajit Chaudhuri,et al.  What next?: a half-dozen data management research goals for big data and the cloud , 2012, PODS '12.

[3]  Ryszard S. Michalski,et al.  A theory and methodology of inductive learning , 1993 .

[4]  Berthold Reinwald,et al.  Towards keyword-driven analytical processing , 2007, SIGMOD '07.

[5]  Anthony K. H. Tung,et al.  DADA: a data cube for dominant relationship analysis , 2006, SIGMOD Conference.

[6]  Emmanuel Müller Efficient knowledge discovery in subspaces of high dimensional databases , 2010 .

[7]  Jiawei Han,et al.  ARCube: supporting ranking aggregate queries in partially materialized data cubes , 2008, SIGMOD Conference.

[8]  Olga Papaemmanouil,et al.  Explore-by-example: an automatic query steering framework for interactive data exploration , 2014, SIGMOD Conference.

[9]  Aditya G. Parameswaran,et al.  SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics , 2015, Proc. VLDB Endow..

[10]  Martin L. Kersten,et al.  Semi-Automated Exploration of Data Warehouses , 2015, CIKM.

[11]  Sunita Sarawagi,et al.  User-Adaptive Exploration of Multidimensional Data , 2000, VLDB.

[12]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[13]  M. W. Green,et al.  2. Handbook of the Logistic Distribution , 1991 .

[14]  Jae-Gil Lee,et al.  Sampling cube: a framework for statistical olap over sampling data , 2008, SIGMOD Conference.

[15]  Jiawei Han,et al.  Computing Iceberg Cubes by Top-Down and Bottom-Up Integration: The StarCubing Approach , 2007, IEEE Transactions on Knowledge and Data Engineering.

[16]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[17]  Surajit Chaudhuri,et al.  An overview of business intelligence technology , 2011, Commun. ACM.

[18]  Chris Anderson,et al.  The Long Tail: Why the Future of Business is Selling Less of More , 2006 .

[19]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[20]  Jiawei Han,et al.  Promotion Analysis in Multi-Dimensional Space , 2009, Proc. VLDB Endow..

[21]  P. Y. Lum,et al.  Extracting insights from the shape of complex data using topology , 2013, Scientific Reports.

[22]  Jun Rao,et al.  Dynamic faceted search for discovery-driven analysis , 2008, CIKM '08.

[23]  Gustavo Alonso,et al.  Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited , 2013, Proc. VLDB Endow..

[24]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[25]  Robert H. Shumway,et al.  Time series analysis and its applications : with R examples , 2017 .

[26]  Nimrod Megiddo,et al.  Discovery-Driven Exploration of OLAP Data Cubes , 1998, EDBT.

[27]  Sunita Sarawagi,et al.  i3: intelligent, interactive investigation of OLAP data cubes , 2000, SIGMOD '00.

[28]  Sunita Sarawagi,et al.  Explaining Differences in Multidimensional Aggregates , 1999, VLDB.

[29]  Martin Krzywinski,et al.  Significance, P values and t-tests , 2013, Nature Methods.

[30]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[31]  Abdul Wasay,et al.  Queriosity: Automated Data Exploration , 2015, 2015 IEEE International Congress on Big Data.

[32]  Surajit Chaudhuri,et al.  Overview of Data Exploration Techniques , 2015, SIGMOD Conference.