OLAP over Probabilistic Data Cubes II: Parallel Materialization and Extended Aggregates

On-Line Analytical Processing (<italic>OLAP</italic>) enables powerful analytics by quickly computing aggregate values of numerical measures over multiple hierarchical dimensions for massive datasets. However, many types of source data, e.g., from GPS, sensors, and other measurement devices, are intrinsically inaccurate (imprecise and/or uncertain) and thus OLAP cannot be readily applied. In this paper, we address the resulting <italic>data veracity</italic> problem in OLAP by proposing the concept of probabilistic data cubes. Such a cube is comprised of a set of probabilistic cuboids which summarize the aggregated values in the form of probability mass functions (pmfs <italic>in short</italic>) and thus offer insights into the underlying data quality and enable confidence-aware query evaluation and analysis. However, the probabilistic nature of data poses computational challenges, since a probabilistic database can have exponential number of possible worlds under the possible world semantics. Even worse, it is hard to share computations among different cuboids, as aggregation functions that are distributive for traditional data cubes, e.g., <inline-formula><tex-math notation="LaTeX">$\tt SUM$</tex-math><alternatives><mml:math><mml:mi mathvariant="monospace">SUM</mml:mi></mml:math><inline-graphic xlink:href="xie-ieq1-2913420.gif"/></alternatives></inline-formula>, become holistic in probabilistic settings. In this paper, we propose a complete set of techniques for probabilistic data cubes, from cuboid aggregation, over cube materialization, to query evaluation. We study two types of aggregation: convolution and sketch-based, which take polynomial time complexities for aggregation and jointly enable efficient query processing. Also, our proposal is versatile in terms of: 1) its capability of supporting common aggregation functions, i.e., <inline-formula><tex-math notation="LaTeX">$\tt SUM$</tex-math><alternatives><mml:math><mml:mi mathvariant="monospace">SUM</mml:mi></mml:math><inline-graphic xlink:href="xie-ieq2-2913420.gif"/></alternatives></inline-formula>, <inline-formula><tex-math notation="LaTeX">$\tt COUNT$</tex-math><alternatives><mml:math><mml:mi mathvariant="monospace">COUNT</mml:mi></mml:math><inline-graphic xlink:href="xie-ieq3-2913420.gif"/></alternatives></inline-formula>, <inline-formula><tex-math notation="LaTeX">$\tt MAX$</tex-math><alternatives><mml:math><mml:mi mathvariant="monospace">MAX</mml:mi></mml:math><inline-graphic xlink:href="xie-ieq4-2913420.gif"/></alternatives></inline-formula>, and <inline-formula><tex-math notation="LaTeX">$\tt AVG$</tex-math><alternatives><mml:math><mml:mi mathvariant="monospace">AVG</mml:mi></mml:math><inline-graphic xlink:href="xie-ieq5-2913420.gif"/></alternatives></inline-formula>; 2) its adaptivity to different materialization strategies, e.g., full versus partial materialization, with support of our devised cost models and parallelization framework; 3) its coverage of common OLAP operations, i.e., probabilistic slicing and dicing queries. Extensive experiments over real and synthetic datasets show that our techniques are effective and scalable.

[1]  Eric Lo,et al.  Accelerating aggregation using intra-cycle parallelism , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[2]  Dimitrios Gunopulos,et al.  Efficiently Computing and Querying Multidimensional OLAP Data Cubes over Probabilistic Relational Data , 2010, ADBIS.

[3]  Gustavo Alonso,et al.  Histograms as a side effect of data movement for big data , 2014, SIGMOD Conference.

[4]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[5]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[6]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[7]  Sebastian Link,et al.  Probabilistic Keys , 2017, IEEE Transactions on Knowledge and Data Engineering.

[8]  Gustavo Alonso,et al.  BatchDB: Efficient Isolated Execution of Hybrid OLTP+OLAP Workloads for Interactive Applications , 2017, SIGMOD Conference.

[9]  Xiaoyong Du,et al.  Elite: an elastic infrastructure for big spatiotemporal trajectories , 2016, The VLDB Journal.

[10]  Torben Bach Pedersen,et al.  OLAP over probabilistic data cubes I: Aggregating, materializing, and querying , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[11]  Dan Suciu,et al.  Bias in OLAP Queries: Detection, Explanation, and Removal , 2018, SIGMOD Conference.

[12]  Xike Xie,et al.  UV-diagram: a voronoi diagram for uncertain spatial databases , 2012, The VLDB Journal.

[13]  Torben Bach Pedersen,et al.  Supporting imprecision in multidimensional databases using granularities , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[14]  Volker John,et al.  Techniques for the reconstruction of a distribution from a finite number of its moments , 2007 .

[15]  M. Kendall,et al.  Kendall's advanced theory of statistics , 1995 .

[16]  Dan Olteanu,et al.  Dichotomies for Queries with Negation in Probabilistic Databases , 2016, TODS.

[17]  Don H. Johnson,et al.  Gauss and the history of the fast Fourier transform , 1984, IEEE ASSP Magazine.

[18]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[19]  T. S. Jayram,et al.  Efficient allocation algorithms for OLAP over imprecise data , 2006, VLDB.

[20]  Yinghai Lu,et al.  Rethinking Concurrency Control for In-Memory OLAP DBMSs , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[21]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[22]  Reynold Cheng,et al.  Efficient Pattern-Based Aggregation on Sequence Data , 2017, IEEE Transactions on Knowledge and Data Engineering.

[23]  Donald Kossmann,et al.  ParTime: Parallel Temporal Aggregation , 2016, SIGMOD Conference.

[24]  Christopher Ré,et al.  The trichotomy of HAVING queries on a probabilistic database , 2009, The VLDB Journal.

[25]  Yufei Tao,et al.  I/O-Efficient Bundled Range Aggregation , 2014, IEEE Transactions on Knowledge and Data Engineering.

[26]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[27]  Wook-Shin Han,et al.  Parallel replication across formats for scaling out mixed OLTP/OLAP workloads in main-memory databases , 2018, The VLDB Journal.

[28]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[29]  Graham Cormode,et al.  Sketching probabilistic data streams , 2007, SIGMOD '07.

[30]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[31]  Xike Xie,et al.  Cleaning uncertain data with quality guarantees , 2008, Proc. VLDB Endow..

[32]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[33]  Hua Lu,et al.  Scalable Evaluation of Trajectory Queries over Imprecise Location Data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[34]  Dan Olteanu,et al.  Aggregation in Probabilistic Databases via Knowledge Compilation , 2012, Proc. VLDB Endow..

[35]  T. S. Jayram,et al.  Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[36]  Torben Bach Pedersen,et al.  Pre-aggregation with probability distributions , 2006, DOLAP '06.

[37]  Xiang Lian,et al.  Efficient processing of probabilistic reverse nearest neighbor queries over uncertain data , 2009, The VLDB Journal.

[38]  Jian Li,et al.  A unified approach to ranking in probabilistic databases , 2009, The VLDB Journal.

[39]  Graham Cormode,et al.  Histograms and Wavelets on Probabilistic Data , 2010, IEEE Trans. Knowl. Data Eng..