论文信息 - Transformation-based Framework for Record Matching

Transformation-based Framework for Record Matching

Today's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We propose a programmatic framework of record matching that takes such user-defined string transformations as input. To the best of our knowledge, this is the first proposal for such a framework. This transformational framework, while expressive, poses significant computational challenges which we address. We empirically evaluate our techniques over real data.

[1] Craig A. Knoblock,et al. A heterogeneous field matching method for record linkage , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[2] Luis Gravano,et al. Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[3] Divesh Srivastava,et al. Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[4] Raymond J. Mooney,et al. Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[5] Lise Getoor,et al. Collective entity resolution in relational data , 2007, TKDD.

[6] S. B. Needleman,et al. A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[7] Divesh Srivastava,et al. Benchmarking declarative approximate selection predicates , 2007, SIGMOD '07.

[8] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[9] Surajit Chaudhuri,et al. Example-driven design of efficient record matching queries , 2007, VLDB.

[10] Jeffrey D. Ullman,et al. Introduction to Automata Theory, Languages and Computation , 1979 .

[11] Raghav Kaushik,et al. Efficient exact set-similarity joins , 2006, VLDB.

[12] Piotr Indyk,et al. Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[13] Rajeev Motwani,et al. Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[14] Surajit Chaudhuri,et al. A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[15] Surajit Chaudhuri,et al. Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[16] Ahmed K. Elmagarmid,et al. Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[17] Richard M. Schwartz,et al. A hidden Markov model information retrieval system , 1999, SIGIR '99.

[18] Matthew A. Jaro,et al. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[19] Panagiotis G. Ipeirotis,et al. Duplicate Record Detection: A Survey , 2007 .

[20] Craig A. Knoblock,et al. Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[21] Jayant Madhavan,et al. Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[22] Cheryl Weant McAfee. The United States Postal Service , 1987 .

[23] Pedro M. Domingos. Multi-Relational Record Linkage , 2003 .

[24] Nikhil Bansal,et al. Correlation Clustering , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[25] Ronald L. Rivest,et al. Introduction to Algorithms , 1990 .

[26] Sunita Sarawagi,et al. ABSTRACT Efficient set joins on similarity predicates , 2004 .

[27] Bing Liu,et al. Correlation Clustering , 2009, Encyclopedia of Database Systems.

[28] Anuradha Bhamidipaty,et al. Interactive deduplication using active learning , 2002, KDD.

[29] William E. Winkler,et al. The State of Record Linkage and Current Research Problems , 1999 .