Describing differences between databases

We study the novel problem of efficiently computing the update distance for a pair of relational databases. In analogy to the edit distance of strings, we define the update distance of two databases as the minimal number of set-oriented insert, delete and modification operations necessary to transform one database into the other. We show how this distance can be computed by traversing a search space of database instances connected by update operations. This insight leads to a family of algorithms that compute the update distance or approximations of it. In our experiments we observed that a simple heuristic performs surprisingly well in most considered cases.Our motivation for studying distance measures for databases stems from the field of scientific databases. There, replicas of a single database are often maintained at different sites, which typically leads to (accidental or planned) divergence of their content. To re-create a consistent view, these differences must be resolved. Such an effort requires an understanding of the process that produced them. We found that minimal update sequences of set-oriented update operations are a proper and concise representation of systematic errors, thus giving valuable clues to domain experts responsible for conflict resolution.

[1]  T. N. Bhat,et al.  The PDB data uniformity project , 2001, Nucleic Acids Res..

[2]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[3]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[4]  Sameer Velankar,et al.  E-MSD: the European Bioinformatics Institute Macromolecular Structure Database , 2003, Nucleic Acids Res..

[5]  Peter Dadam,et al.  Integration of Time Versions into a Relational Database System , 1984, VLDB.

[6]  Gottfried Vossen Data models, database languages and database management systems , 1990, International computer science series.

[7]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[8]  H. Lehmann,et al.  Nucleic Acid Research , 1967 .

[9]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[10]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[11]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[12]  Ulf Leser,et al.  Mining for patterns in contradictory data , 2004, IQIS '04.

[13]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[14]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[15]  Hector Garcia-Molina,et al.  Efficient Snapshot Differential Algorithms for Data Warehousing , 1996, VLDB.

[16]  Hongjun Lu,et al.  Discovering and reconciling value conflicts for numerical data integration , 2001, Inf. Syst..

[17]  Hector Garcia-Molina,et al.  Meaningful change detection in structured data , 1997, SIGMOD '97.

[18]  Anthony K. H. Tung,et al.  FARMER: finding interesting rule groups in microarray datasets , 2004, SIGMOD '04.

[19]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[20]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[21]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[22]  Thomas Steinke,et al.  Columba: Multidimensional Data Integration of Protein Annotations , 2004, DILS.

[23]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[24]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.