论文信息 - Sampling dirty data for matching attributes

Sampling dirty data for matching attributes

We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often 'dirty', especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples.

[1] Peter J. Haas,et al. The New Jersey Data Reduction Report , 1997 .

[2] Anthony K. H. Tung,et al. Validating Multi-column Schema Matchings by Type , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[3] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[4] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[5] Yossi Matias,et al. Bifocal sampling for skew-resistant join size estimation , 1996, SIGMOD '96.

[6] Joann J. Ordille,et al. Data integration: the teenage years , 2006, VLDB.

[7] William W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[8] C. E. SHANNON,et al. A mathematical theory of communication , 1948, MOCO.

[9] David Maier,et al. From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[10] Udi Manber,et al. Finding Similar Files in a Large File System , 1994, USENIX Winter.

[11] Felix Naumann,et al. Efficiently Detecting Inclusion Dependencies , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[12] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[13] Doron Rotem,et al. Random sampling from databases: a survey , 1995 .

[14] Rajeev Motwani,et al. On random sampling over joins , 1999, SIGMOD '99.

[15] Ahmed K. Elmagarmid,et al. Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.