Incremental Detection of Inconsistencies in Distributed Data

This paper investigates incremental detection of errors in distributed data. Given a distributed database \(D\) , a set \(\Sigma \) of conditional functional dependencies (CFDs), the set \( { {\mathsf {V}}}\) of violations of the CFDs in \(D\) , and updates \( \Delta {D}\) to \(D\) , it is to find, with minimum data shipment, changes \( \Delta { {\mathsf {V}}}\) to \( { {\mathsf {V}}}\) in response to \( \Delta {D}\) . The need for the study is evident since real-life data is often dirty, distributed and frequently updated. It is often prohibitively expensive to recompute the entire set of violations when \(D\) is updated. We show that the incremental detection problem is NP-complete for database \(D\) that is partitioned either vertically or horizontally, even when \(\Sigma \) and \(D\) are fixed. Nevertheless, we show that it is bounded : there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of \( \Delta {D}\) and \( \Delta { {\mathsf {V}}}\) , independent of the size of the database \(D\) . We provide such incremental algorithms for vertically partitioned data and horizontally partitioned data, and show that the algorithms are optimal. We further propose optimization techniques for the incremental algorithm over vertical partitions to reduce data shipment. We verify experimentally, using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts.

[1]  Jens Teubner,et al.  A Spinning Join That Does Not Get Dizzy , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[2]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[3]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[4]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[5]  Thomas W. Reps,et al.  A categorized bibliography on incremental computation , 1993, POPL '93.

[6]  Frank Neven,et al.  Scalable multi-query optimization for exploratory queries over federated scientific databases , 2008, Proc. VLDB Endow..

[7]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[8]  Frank Wm. Tompa,et al.  Efficiently updating materialized views , 1986, SIGMOD '86.

[9]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[10]  Frank Wm. Tompa,et al.  Optimal top-down join enumeration , 2007, SIGMOD '07.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Nam Huyn,et al.  Maintaining Global Integrity Constraints in Distributed Databases , 2004, Constraints.

[13]  Shuai Ma,et al.  Detecting inconsistencies in distributed data , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[14]  Philip A. Bernstein,et al.  Using Semi-Joins to Solve Relational Queries , 1981, JACM.

[15]  V. S. Subrahmanian,et al.  Maintaining views incrementally , 1993, SIGMOD Conference.

[16]  Ashish Gupta,et al.  Materialized views: techniques, implementations, and applications , 1999 .

[17]  Rajeev Rastogi,et al.  Efficient Detection of Distributed Constraint Violations , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[18]  Michael Stonebraker,et al.  H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[19]  Thomas W. Reps,et al.  On the Computational Complexity of Dynamic Graph Problems , 1996, Theor. Comput. Sci..

[20]  Jennifer Widom,et al.  Local verification of global integrity constraints in distributed databases , 1993, SIGMOD '93.

[21]  Nick Roussopoulos,et al.  An incremental access method for ViewCache: concept, algorithms, and cost analysis , 1991, TODS.

[22]  Guido Moerkotte,et al.  Dynamic programming strikes back , 2008, SIGMOD Conference.

[23]  Andreas Terzis,et al.  Network-Aware Join Processing in Global-Scale Database Federations , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[24]  Jianzhong Li,et al.  Incremental Detection of Inconsistencies in Distributed Data , 2012, IEEE Transactions on Knowledge and Data Engineering.

[25]  Samir Khuller,et al.  Minimizing Communication Cost in Distributed Multi-query Processing , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[26]  James Bailey,et al.  Incremental View Maintenance By Base Relation Tagging in Distributed Databases , 2004, Distributed and Parallel Databases.

[27]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[28]  Guy M. Lohman,et al.  Optimizer Validation and Performance Evaluation for Distributed Queries , 1998 .