论文信息 - Automatic Data Fusion with HumMer

Automatic Data Fusion with HumMer

Heterogeneous and dirty data is abundant. It is stored under different, often opaque schemata, it represents identical real-world objects multiple times, causing duplicates, and it has missing values and conflicting values. The Humboldt Merger (HumMer) is a tool that allows ad-hoc, declarative fusion of such data using a simple extension to SQL.Guided by a query against multiple tables, HumMer proceeds in three fully automated steps: First, instance-based schema matching bridges schematic heterogeneity of the tables by aligning corresponding attributes. Next, duplicate detection techniques find multiple representations of identical real-world objects. Finally, data fusion and conflict resolution merges duplicates into a single, consistent, and clean representation.

[1] Felix Naumann,et al. Declarative Data Fusion - Syntax, Semantics, and Implementation , 2005, ADBIS.

[2] Jennifer Widom,et al. Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[3] Felix Naumann,et al. Schema matching using duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[4] Michael Stonebraker,et al. THALIA: Test Harness for the Assessment of Legacy Information Integration Approaches , 2005, 21st International Conference on Data Engineering (ICDE'05).

[5] Felix Naumann,et al. DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[6] Bernhard Seeger,et al. XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries , 2001, VLDB.

[7] Pradeep Ravikumar,et al. A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.