论文信息 - Describing differences between databases - 字舞流文

Describing differences between databases

We study the novel problem of efficiently computing the update distance for a pair of relational databases. In analogy to the edit distance of strings, we define the update distance of two databases as the minimal number of set-oriented insert, delete and modification operations necessary to transform one database into the other. We show how this distance can be computed by traversing a search space of database instances connected by update operations. This insight leads to a family of algorithms that compute the update distance or approximations of it. In our experiments we observed that a simple heuristic performs surprisingly well in most considered cases.Our motivation for studying distance measures for databases stems from the field of scientific databases. There, replicas of a single database are often maintained at different sites, which typically leads to (accidental or planned) divergence of their content. To re-create a consistent view, these differences must be resolved. Such an effort requires an understanding of the process that produced them. We found that minimal update sequences of set-oriented update operations are a proper and concise representation of systematic errors, thus giving valuable clues to domain experts responsible for conflict resolution.

Ulf Leser | Heiko Müller | Johann-Christoph Freytag | J. Freytag | U. Leser | Heiko Müller

[1] T. N. Bhat,et al. The PDB data uniformity project , 2001, Nucleic Acids Res..

[2] Rajeev Rastogi,et al. A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[3] Mohammed J. Zaki,et al. CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[4] Sameer Velankar,et al. E-MSD: the European Bioinformatics Institute Macromolecular Structure Database , 2003, Nucleic Acids Res..

[5] Peter Dadam,et al. Integration of Time Versions into a Relational Database System , 1984, VLDB.

[6] Gottfried Vossen. Data models, database languages and database management systems , 1990, International computer science series.

[7] William E. Winkler,et al. The State of Record Linkage and Current Research Problems , 1999 .

[8] H. Lehmann,et al. Nucleic Acid Research , 1967 .

[9] Charles Elkan,et al. An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[10] T. N. Bhat,et al. The Protein Data Bank , 2000, Nucleic Acids Res..

[11] Roberto J. Bayardo,et al. Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[12] Ulf Leser,et al. Mining for patterns in contradictory data , 2004, IQIS '04.

[13] Salvatore J. Stolfo,et al. The merge/purge problem for large databases , 1995, SIGMOD '95.

[14] Rakesh Agarwal,et al. Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[15] Hector Garcia-Molina,et al. Efficient Snapshot Differential Algorithms for Data Warehousing , 1996, VLDB.

[16] Hongjun Lu,et al. Discovering and reconciling value conflicts for numerical data integration , 2001, Inf. Syst..

[17] Hector Garcia-Molina,et al. Meaningful change detection in structured data , 1997, SIGMOD '97.

[18] Anthony K. H. Tung,et al. FARMER: finding interesting rule groups in microarray datasets , 2004, SIGMOD '04.

[19] Nicolas Pasquier,et al. Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[20] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[21] Jan Chomicki,et al. Consistent query answers in inconsistent databases , 1999, PODS '99.

[22] Thomas Steinke,et al. Columba: Multidimensional Data Integration of Protein Annotations , 2004, DILS.

[23] Jian Pei,et al. CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[24] Ramakrishnan Srikant,et al. Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.