Sampling dirty data for matching attributes

We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often 'dirty', especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples.

[1]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[2]  Anthony K. H. Tung,et al.  Validating Multi-column Schema Matchings by Type , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[3]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[4]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[5]  Yossi Matias,et al.  Bifocal sampling for skew-resistant join size estimation , 1996, SIGMOD '96.

[6]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[7]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[8]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[9]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[10]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[11]  Felix Naumann,et al.  Efficiently Detecting Inclusion Dependencies , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[12]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[13]  Doron Rotem,et al.  Random sampling from databases: a survey , 1995 .

[14]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[15]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[16]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[17]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[18]  AnHai Doan,et al.  iMAP: Discovering Complex Mappings between Database Schemas. , 2004, SIGMOD 2004.

[19]  Doron Rotem,et al.  Simple Random Sampling from Relational Databases , 1986, VLDB.

[20]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[21]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[22]  Calisto Zuzarte,et al.  Query sampling in DB2 Universal Database , 2004, SIGMOD '04.

[23]  Surajit Chaudhuri,et al.  Effective use of block-level sampling in statistics estimation , 2004, SIGMOD '04.

[24]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[25]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[26]  Peter J. Haas,et al.  A bi-level Bernoulli scheme for database sampling , 2004, SIGMOD '04.

[27]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[28]  Laura M. Haas,et al.  Schema Mapping as Query Discovery , 2000, VLDB.

[29]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[30]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.