OLAP over Probabilistic Data Cubes II: on Parallelization and Aggregating Extension

On-Line Analytical Processing (OLAP) enables powerful analytics by quickly computing aggregate values of numerical measures over multiple hierarchical dimensions for massive datasets. However, many types of source data, e.g., from GPS, sensors, and other measurement devices, are intrinsically inaccurate (imprecise and/or uncertain) and thus OLAP cannot be readily applied. In this paper, we address the resulting data veracity problem in OLAP by proposing the concept of probabilistic data cubes. Such a cube is comprised of a set of probabilistic cuboids which summarize the aggregated values in the form of probability mass functions (pmfs in short) and thus offer insights into the underlying data quality and enable confidence-aware query evaluation and analysis. However, the probabilistic nature of data poses computational challenges, since a probabilistic database can have exponential number of possible worlds under the possible world semantics. Even worse, it is hard to share computations among different cuboids, as aggregation functions that are distributive for traditional data cubes, e.g., SUM, become holistic in probabilistic settings. In this paper, we propose a complete set of techniques for probabilistic data cubes, from cuboid aggregation, over cube materialization, to query evaluation. We study two types of aggregation: convolution and sketch-based, which take polynomial time complexities for aggregation and jointly enable efficient query processing. Also, our proposal is versatile in terms of: 1) its capability of supporting common aggregation functions, i.e., SUM, COUNT, MAX, and AVG; 2) its adaptivity to different materialization strategies, e.g., full versus partial materialization, with support of our devised cost models and parallelization framework; 3) its coverage of common OLAP operations, i.e., probabilistic slicing and dicing queries. Extensive experiments over real and synthetic datasets show that our techniques are effective and scalable.

[1]  Dan Olteanu,et al.  Aggregation in Probabilistic Databases via Knowledge Compilation , 2012, Proc. VLDB Endow..

[2]  T. S. Jayram,et al.  Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[3]  Dieter Pfoser,et al.  Capturing the Uncertainty of Moving-Object Representations , 1999, SSD.

[4]  Dan Suciu,et al.  Bias in OLAP Queries: Detection, Explanation, and Removal , 2018, SIGMOD Conference.

[5]  Sebastian Link,et al.  Probabilistic Keys , 2017, IEEE Transactions on Knowledge and Data Engineering.

[6]  Torben Bach Pedersen,et al.  Supporting imprecision in multidimensional databases using granularities , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[7]  Dan Olteanu,et al.  Dichotomies for Queries with Negation in Probabilistic Databases , 2016, TODS.

[8]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[9]  T. S. Jayram,et al.  Efficient allocation algorithms for OLAP over imprecise data , 2006, VLDB.

[10]  Torben Bach Pedersen,et al.  OLAP over probabilistic data cubes I: Aggregating, materializing, and querying , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[11]  Yinghai Lu,et al.  Rethinking Concurrency Control for In-Memory OLAP DBMSs , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[12]  Xiang Lian,et al.  Efficient processing of probabilistic reverse nearest neighbor queries over uncertain data , 2009, The VLDB Journal.

[13]  Graham Cormode,et al.  Sketching probabilistic data streams , 2007, SIGMOD '07.

[14]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[15]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[16]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[17]  Don H. Johnson,et al.  Gauss and the history of the fast Fourier transform , 1985 .

[18]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[19]  Gustavo Alonso,et al.  BatchDB: Efficient Isolated Execution of Hybrid OLTP+OLAP Workloads for Interactive Applications , 2017, SIGMOD Conference.

[20]  Xiaoyong Du,et al.  Elite: an elastic infrastructure for big spatiotemporal trajectories , 2016, The VLDB Journal.

[21]  Torben Bach Pedersen,et al.  Pre-aggregation with probability distributions , 2006, DOLAP '06.

[22]  Eric Lo,et al.  Accelerating aggregation using intra-cycle parallelism , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[23]  Dimitrios Gunopulos,et al.  Efficiently Computing and Querying Multidimensional OLAP Data Cubes over Probabilistic Relational Data , 2010, ADBIS.

[24]  Gustavo Alonso,et al.  Histograms as a side effect of data movement for big data , 2014, SIGMOD Conference.

[25]  Hua Lu,et al.  Scalable Evaluation of Trajectory Queries over Imprecise Location Data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[26]  Donald Kossmann,et al.  ParTime: Parallel Temporal Aggregation , 2016, SIGMOD Conference.

[27]  Volker John,et al.  Techniques for the reconstruction of a distribution from a finite number of its moments , 2007 .

[28]  Christopher Ré,et al.  The trichotomy of HAVING queries on a probabilistic database , 2009, The VLDB Journal.

[29]  Xike Xie,et al.  Cleaning uncertain data with quality guarantees , 2008, Proc. VLDB Endow..

[30]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[31]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[32]  Graham Cormode,et al.  Histograms and Wavelets on Probabilistic Data , 2010, IEEE Trans. Knowl. Data Eng..

[33]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[34]  Xike Xie,et al.  UV-diagram: a voronoi diagram for uncertain spatial databases , 2012, The VLDB Journal.

[35]  Reynold Cheng,et al.  Efficient Pattern-Based Aggregation on Sequence Data , 2017, IEEE Transactions on Knowledge and Data Engineering.

[36]  Yufei Tao,et al.  I/O-Efficient Bundled Range Aggregation , 2014, IEEE Transactions on Knowledge and Data Engineering.

[37]  R. Baierlein Probability Theory: The Logic of Science , 2004 .

[38]  Wook-Shin Han,et al.  Parallel replication across formats for scaling out mixed OLTP/OLAP workloads in main-memory databases , 2018, The VLDB Journal.