OLAP over Imprecise Data with Domain Constraints

Several recent papers have focused on OLAP over imprecise data, where each fact can be a region, instead of a point, in a multi-dimensional space. They have provided a multiple-world semantics for such data, and developed efficient ways to answer OLAP aggregation queries over the imprecise facts. These solutions, however, assume that the imprecise facts can be interpreted independently of one another, a key assumption that is often violated in practice. Indeed, imprecise facts in real-world applications are often correlated, and such correlations can be captured as domain integrity constraints (e.g., repairs with the same customer names and models took place in the same city, or a text span can refer to a person or a city, but not both). In this paper we provide a framework for answering OLAP aggregation queries over imprecise data in the presence of such domain constraints. We first describe a relatively simple yet powerful constraint language, and formalize what it means to take into account such constraints in query answering. Next, we prove that OLAP queries can be answered efficiently given a database D* of fact marginals. We then exploit the regularities in the constraint space (captured in a constraint hypergraph) and the fact space to efficiently construct D*. We present extensive experiments over real-world and synthetic data to demonstrate the effectiveness of our approach.

[1]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[2]  T. S. Jayram,et al.  Efficient allocation algorithms for OLAP over imprecise data , 2006, VLDB.

[3]  Dan Olteanu,et al.  $${10^{(10^{6})}}$$ worlds and beyond: efficient representation and processing of incomplete information , 2006, 2007 IEEE 23rd International Conference on Data Engineering.

[4]  Maurizio Rafanelli Multidimensional Databases: Problems and Solutions , 2003 .

[5]  Joseph Y. Halpern Reasoning about uncertainty , 2003 .

[6]  Raghu Ramakrishnan,et al.  A performance study of transitive closure algorithms , 1994, SIGMOD '94.

[7]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[8]  Jan Chomicki,et al.  Computing consistent query answers using conflict hypergraphs , 2004, CIKM '04.

[9]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[10]  Alberto O. Mendelzon,et al.  Capturing summarizability with integrity constraints in OLAP , 2005, TODS.

[11]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[12]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[13]  Filippo Furfaro,et al.  Consistent Query Answers on Numerical Databases Under Aggregate Constraints , 2005, DBPL.

[14]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB journal.

[15]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[16]  Raghu Ramakrishnan,et al.  Community Information Management , 2006, IEEE Data Eng. Bull..

[17]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[18]  Dan Olteanu,et al.  10106 Worlds and Beyond: Efficient Representation and Processing of Incomplete Information , 2007, ICDE.

[19]  Sriram Raghavan,et al.  Avatar Information Extraction System , 2006, IEEE Data Eng. Bull..

[20]  Jan Chomicki,et al.  Query Answering in Inconsistent Databases , 2003, Logics for Emerging Applications of Databases.

[21]  Leopoldo E. Bertossi,et al.  Consistent query answering in databases , 2006, SGMD.

[22]  Magnús M. Halldórsson,et al.  Journal of Graph Algorithms and Applications Approximations of Weighted Independent Set and Hereditary Subset Problems , 2022 .

[23]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[24]  Moshe Y. Vardi On the integrity of databases with incomplete information , 1985, PODS.

[25]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[26]  Xin He,et al.  Scalar aggregation in inconsistent databases , 2003, Theor. Comput. Sci..

[27]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[28]  Gottfried Vossen,et al.  Aggregate Queries Over Conditional Tables , 2002, Journal of Intelligent Information Systems.

[29]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[30]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[31]  Kenneth A. Ross,et al.  Foundations of Aggregation Constraints , 1994, PPCP.

[32]  Alberto O. Mendelzon,et al.  OLAP dimension constraints , 2002, PODS '02.

[33]  H. V. Jagadish,et al.  Direct transitive closure algorithms: design and performance evaluation , 1990, TODS.

[34]  Carlo Zaniolo Database relations with null values , 1982, PODS '82.

[35]  Adnan Darwiche,et al.  Inference in belief networks: A procedural guide , 1996, Int. J. Approx. Reason..

[36]  Raghu Ramakrishnan,et al.  Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach , 2007, VLDB.

[37]  Anastasia Ailamaki,et al.  Challenges inbuilding a DBMS Resource Advisor , 2006, IEEE Data Eng. Bull..

[38]  Arie Shoshani,et al.  Summarizability in OLAP and statistical data bases , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).