论文信息 - Exploring Data Partitions for What-if Analysis

Exploring Data Partitions for What-if Analysis

What-if analysis is a data-intensive exploration to inspect how changes in a set of input parameters of a model influence some outcomes. It is motivated by a user trying to understand the sensitivity of a model to a certain parameter in order to reach a set of goals that are defined over the outcomes. To avoid an exploration of all possible combinations of parameter values, efficient what-if analysis calls for a partitioning of parameter values into data ranges and a unified representation of the obtained outcomes per range. Traditional techniques to capture data ranges, such as histograms, are limited to one outcome dimension. Yet, in practice, what-if analysis often involves conflicting goals that are defined over different dimensions of the outcome. Working on each of those goals independently cannot capture the inherent trade-off between them. In this paper, we propose techniques to recommend data ranges for what-if analysis, which capture not only data regularities, but also the trade-off between conflicting goals. Specifically, we formulate a parametric data partitioning problem and propose a method to find an optimal solution for it. Targeting scalability to large datasets, we further provide a heuristic solution to this problem. By theoretical and empirical analyses, we establish performance guarantees in terms of runtime and result quality.

[1] Herodotos Herodotou,et al. Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[2] Yi Wang,et al. A novel approach for approximate aggregations over arrays , 2015, SSDBM.

[3] Hao Wang,et al. Adapting to User Interest Drift for POI Recommendation , 2016, IEEE Transactions on Knowledge and Data Engineering.

[4] Paola Annoni,et al. Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index , 2010, Comput. Phys. Commun..

[5] Karl Aberer,et al. Minimizing Efforts in Validating Crowd Answers , 2015, SIGMOD Conference.

[6] Shazia Wasim Sadiq,et al. Discovering interpretable geo-social communities for user behavior prediction , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[7] Matthias Weidlich,et al. Retaining Data from Streams of Social Platforms with Minimal Regret , 2017, IJCAI.

[8] Tiejun Lv,et al. A Novel Centrality Cascading Based Edge Parameter Evaluation Method for Robust Influence Maximization , 2017, IEEE Access.

[9] Donald Kossmann,et al. The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[10] Kyuseok Shim,et al. Parallel Computation of Skyline and Reverse Skyline Queries Using MapReduce , 2013, Proc. VLDB Endow..

[11] Matthias Weidlich,et al. Computing Crowd Consensus with Partial Agreement , 2018, IEEE Transactions on Knowledge and Data Engineering.

[12] Mikhail J. Atallah,et al. Computing all skyline probabilities for uncertain data , 2009, PODS.

[13] Yannis E. Ioannidis,et al. The History of Histograms (abridged) , 2003, VLDB.

[14] Sudipto Guha,et al. REHIST: Relative Error Histogram Construction Algorithms , 2004, VLDB.

[15] Seung-won Hwang,et al. Supporting efficient distributed skyline computation using skyline views , 2012, Inf. Sci..

[16] Mohamed A. Sharaf,et al. MuVE: Efficient Multi-Objective View Recommendation for Visual Data Exploration , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[17] Karl Aberer,et al. Argument discovery via crowdsourcing , 2017, The VLDB Journal.

[18] R. K. Ursem. Multi-objective Optimization using Evolutionary Algorithms , 2009 .

[19] Donald Kossmann,et al. Shooting Stars in the Sky: An Online Algorithm for Skyline Queries , 2002, VLDB.

[20] Carlo Zaniolo,et al. Fast and accurate computation of equi-depth histograms over data streams , 2011, EDBT/ICDT '11.

[21] Jack P. C. Kleijnen,et al. Sensitivity analysis of simulation models: an overview , 2010 .

[22] Wim Sweldens,et al. An Overview of Wavelet Based Multiresolution Analyses , 1994, SIAM Rev..

[23] Peter J. Haas,et al. Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[24] Beng Chin Ooi,et al. Efficient Progressive Skyline Computation , 2001, VLDB.

[25] John Sweller,et al. Cognitive Load Theory , 2020, Encyclopedia of Education and Information Technologies.

[26] Yuan Tian,et al. Z-SKY: an efficient skyline query processing framework based on Z-order , 2010, The VLDB Journal.

[27] Bernhard Seeger,et al. Progressive skyline computation in database systems , 2005, TODS.

[28] Paul P. Maglio,et al. Data is dead... without what-if models , 2011, Proc. VLDB Endow..

[29] Daniel Deutch,et al. Caravan: Provisioning for What-If Analysis , 2013, CIDR.

[30] Jan Chomicki,et al. Preference elicitation in prioritized skyline queries , 2010, The VLDB Journal.

[31] Rajeev Motwani,et al. Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[32] Matteo Golfarelli,et al. What-if Simulation Modeling in Business Intelligence , 2009, Int. J. Data Warehous. Min..

[33] Liang Chen,et al. Mobi-SAGE: A Sparse Additive Generative Model for Mobile App Recommendation , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[34] Sudipto Guha,et al. Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[35] Yang Wang,et al. SPTF: A Scalable Probabilistic Tensor Factorization Model for Semantic-Aware Behavior Prediction , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[36] Surajit Chaudhuri,et al. Overview of Data Exploration Techniques , 2015, SIGMOD Conference.

[37] Douglas Alves Peixoto,et al. Scalable and Fast Top-k Most Similar Trajectories Search Using MapReduce In-Memory , 2016, ADC.

[38] Christian Bird,et al. Assessing the value of branches with what-if analysis , 2012, SIGSOFT FSE.

[39] Sudipto Guha,et al. Data-streams and histograms , 2001, STOC '01.

[40] Yannis E. Ioannidis,et al. Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[41] Karl Aberer,et al. Answer validation for generic crowdsourcing tasks with minimal efforts , 2017, The VLDB Journal.

[42] Sen Wang,et al. Provenance-Based Rumor Detection , 2017, ADC.

[43] Yufei Tao,et al. Distributed Skyline Retrieval with Low Bandwidth Consumption , 2009, IEEE Transactions on Knowledge and Data Engineering.

[44] Ronitt Rubinfeld,et al. Approximating and testing k-histogram distributions in sub-linear time , 2012, PODS '12.

[45] Deok-Hwan Kim,et al. Multi-dimensional selectivity estimation using compressed histogram information , 1999, SIGMOD '99.

[46] Hasso Plattner,et al. Interactive, Flexible, and Generic What-If Analyses Using In-Memory Column Stores , 2015, DASFAA.

[47] H. Arsham,et al. “What-if” analysis in computer simulation models: A comparative survey with some extensions , 1990 .

[48] Torsten Suel,et al. Optimal Histograms with Quality Guarantees , 1998, VLDB.