Integrating Feature Analysis and Background Knowledge to Recommend Similarity Functions

Existing approaches in similarity analysis is little concerned with the right choice of similarity functions. We present an approach for suggesting which similarity functions (e.g., edit distance) are most appropriate for a given similarity search task. We identify data features (e.g., misspellings) that are considerable when choosing similarity functions. We also introduce the concept of similarity function background knowledge that associates data features with similarity functions, and apply the knowledge to recommend suitable similarity functions.

[1]  Surajit Chaudhuri,et al.  Example-driven design of efficient record matching queries , 2007, VLDB.

[2]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  Guoliang Li,et al.  Fast-join: An efficient method for fuzzy token matching based string similarity join , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[4]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[5]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[6]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[7]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[8]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[9]  Felix Naumann,et al.  Efficient similarity search: arbitrary similarity measures, arbitrary composition , 2011, CIKM '11.

[10]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[11]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[12]  Hye-Young Paik,et al.  Similarity Function Recommender Service Using Incremental User Knowledge Acquisition , 2011, ICSOC.

[13]  Andreas Thor,et al.  Learning-Based Approaches for Matching Web Data Entities , 2010, IEEE Internet Computing.

[14]  Sigal Sahar,et al.  Interestingness via what is not interesting , 1999, KDD '99.