Exploring Data Partitions for What-if Analysis

What-if analysis is a data-intensive exploration to inspect how changes in a set of input parameters of a model influence some outcomes. It is motivated by a user trying to understand the sensitivity of a model to a certain parameter in order to reach a set of goals that are defined over the outcomes. To avoid an exploration of all possible combinations of parameter values, efficient what-if analysis calls for a partitioning of parameter values into data ranges and a unified representation of the obtained outcomes per range. Traditional techniques to capture data ranges, such as histograms, are limited to one outcome dimension. Yet, in practice, what-if analysis often involves conflicting goals that are defined over different dimensions of the outcome. Working on each of those goals independently cannot capture the inherent trade-off between them. In this paper, we propose techniques to recommend data ranges for what-if analysis, which capture not only data regularities, but also the trade-off between conflicting goals. Specifically, we formulate a parametric data partitioning problem and propose a method to find an optimal solution for it. Targeting scalability to large datasets, we further provide a heuristic solution to this problem. By theoretical and empirical analyses, we establish performance guarantees in terms of runtime and result quality.

[1]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[2]  Yi Wang,et al.  A novel approach for approximate aggregations over arrays , 2015, SSDBM.

[3]  Hao Wang,et al.  Adapting to User Interest Drift for POI Recommendation , 2016, IEEE Transactions on Knowledge and Data Engineering.

[4]  Paola Annoni,et al.  Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index , 2010, Comput. Phys. Commun..

[5]  Karl Aberer,et al.  Minimizing Efforts in Validating Crowd Answers , 2015, SIGMOD Conference.

[6]  Shazia Wasim Sadiq,et al.  Discovering interpretable geo-social communities for user behavior prediction , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[7]  Matthias Weidlich,et al.  Retaining Data from Streams of Social Platforms with Minimal Regret , 2017, IJCAI.

[8]  Tiejun Lv,et al.  A Novel Centrality Cascading Based Edge Parameter Evaluation Method for Robust Influence Maximization , 2017, IEEE Access.

[9]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[10]  Kyuseok Shim,et al.  Parallel Computation of Skyline and Reverse Skyline Queries Using MapReduce , 2013, Proc. VLDB Endow..

[11]  Matthias Weidlich,et al.  Computing Crowd Consensus with Partial Agreement , 2018, IEEE Transactions on Knowledge and Data Engineering.

[12]  Mikhail J. Atallah,et al.  Computing all skyline probabilities for uncertain data , 2009, PODS.

[13]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[14]  Sudipto Guha,et al.  REHIST: Relative Error Histogram Construction Algorithms , 2004, VLDB.

[15]  Seung-won Hwang,et al.  Supporting efficient distributed skyline computation using skyline views , 2012, Inf. Sci..

[16]  Mohamed A. Sharaf,et al.  MuVE: Efficient Multi-Objective View Recommendation for Visual Data Exploration , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[17]  Karl Aberer,et al.  Argument discovery via crowdsourcing , 2017, The VLDB Journal.

[18]  R. K. Ursem Multi-objective Optimization using Evolutionary Algorithms , 2009 .

[19]  Donald Kossmann,et al.  Shooting Stars in the Sky: An Online Algorithm for Skyline Queries , 2002, VLDB.

[20]  Carlo Zaniolo,et al.  Fast and accurate computation of equi-depth histograms over data streams , 2011, EDBT/ICDT '11.

[21]  Jack P. C. Kleijnen,et al.  Sensitivity analysis of simulation models: an overview , 2010 .

[22]  Wim Sweldens,et al.  An Overview of Wavelet Based Multiresolution Analyses , 1994, SIAM Rev..

[23]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[24]  Beng Chin Ooi,et al.  Efficient Progressive Skyline Computation , 2001, VLDB.

[25]  John Sweller,et al.  Cognitive Load Theory , 2020, Encyclopedia of Education and Information Technologies.

[26]  Yuan Tian,et al.  Z-SKY: an efficient skyline query processing framework based on Z-order , 2010, The VLDB Journal.

[27]  Bernhard Seeger,et al.  Progressive skyline computation in database systems , 2005, TODS.

[28]  Paul P. Maglio,et al.  Data is dead... without what-if models , 2011, Proc. VLDB Endow..

[29]  Daniel Deutch,et al.  Caravan: Provisioning for What-If Analysis , 2013, CIDR.

[30]  Jan Chomicki,et al.  Preference elicitation in prioritized skyline queries , 2010, The VLDB Journal.

[31]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[32]  Matteo Golfarelli,et al.  What-if Simulation Modeling in Business Intelligence , 2009, Int. J. Data Warehous. Min..

[33]  Liang Chen,et al.  Mobi-SAGE: A Sparse Additive Generative Model for Mobile App Recommendation , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[34]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[35]  Yang Wang,et al.  SPTF: A Scalable Probabilistic Tensor Factorization Model for Semantic-Aware Behavior Prediction , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[36]  Surajit Chaudhuri,et al.  Overview of Data Exploration Techniques , 2015, SIGMOD Conference.

[37]  Douglas Alves Peixoto,et al.  Scalable and Fast Top-k Most Similar Trajectories Search Using MapReduce In-Memory , 2016, ADC.

[38]  Christian Bird,et al.  Assessing the value of branches with what-if analysis , 2012, SIGSOFT FSE.

[39]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[40]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[41]  Karl Aberer,et al.  Answer validation for generic crowdsourcing tasks with minimal efforts , 2017, The VLDB Journal.

[42]  Sen Wang,et al.  Provenance-Based Rumor Detection , 2017, ADC.

[43]  Yufei Tao,et al.  Distributed Skyline Retrieval with Low Bandwidth Consumption , 2009, IEEE Transactions on Knowledge and Data Engineering.

[44]  Ronitt Rubinfeld,et al.  Approximating and testing k-histogram distributions in sub-linear time , 2012, PODS '12.

[45]  Deok-Hwan Kim,et al.  Multi-dimensional selectivity estimation using compressed histogram information , 1999, SIGMOD '99.

[46]  Hasso Plattner,et al.  Interactive, Flexible, and Generic What-If Analyses Using In-Memory Column Stores , 2015, DASFAA.

[47]  H. Arsham,et al.  “What-if” analysis in computer simulation models: A comparative survey with some extensions , 1990 .

[48]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.