Automatic threshold estimation for data matching applications

Several advanced data management applications, such as data integration, data deduplication or similarity querying rely on the application of similarity functions. A similarity function requires the definition of a threshold value in order to assess if two different data instances match, i.e., if they represent the same real world object. In this context, the threshold definition is a central problem. In this paper, we propose a method for the estimation of the quality of a similarity function. Quality is measured in terms of recall and precision calculated at several different thresholds. On the basis of the results of the proposed estimation process, and taking into account the requirements of a specific application, a user is able to choose a threshold value that is adequate for the application. The proposed estimation process is based on a clustering phase performed on a sample taken from a data collection and requires no human intervention.

[1]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[2]  Avi Arampatzis,et al.  The score-distributional threshold optimization for adaptive binary classification tasks , 2001, SIGIR '01.

[3]  Benjamin King Step-Wise Clustering Procedures , 1967 .

[4]  Peter Christen,et al.  Febrl - A Parallel Open Source Data Linkage System: http://datamining.anu.edu.au/linkage.html , 2004, PAKDD.

[5]  Thomas Kunz,et al.  Using Automatic Process Clustering for Design Recovery and Distributed Debugging , 1995, IEEE Trans. Software Eng..

[6]  Wenfei Fan,et al.  (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, August 31 - September 3 2004 , 2004 .

[7]  Ricardo J. G. B. Campello,et al.  On comparing two sequences of numbers and its applications to clustering analysis , 2009, Inf. Sci..

[8]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[9]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[10]  Flora S. Tsai,et al.  Evaluation of novelty metrics for sentence-level novelty mining , 2010, Inf. Sci..

[11]  Ricardo J. G. B. Campello,et al.  Evolving clusters in gene-expression data , 2006, Inf. Sci..

[12]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[13]  Sudipto Guha,et al.  Reasoning About Approximate Match Query Results , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[14]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[15]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[16]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[17]  Carlos Alberto Heuser,et al.  Measuring quality of similarity functions in approximate data matching , 2007, J. Informetrics.

[18]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[19]  Keke Chen,et al.  Determining the best K , 2009, Data Knowl. Eng..

[20]  P. Balasubramanie,et al.  Wavelet Feature Based Neural Classifier System for Object Classification with Complex Background , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[21]  M. Aldenderfer,et al.  Cluster Analysis. Sage University Paper Series On Quantitative Applications in the Social Sciences 07-044 , 1984 .

[22]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2008, Information Retrieval.

[23]  Yi Zhang,et al.  Maximum likelihood estimation for filtering thresholds , 2001, SIGIR '01.

[24]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[25]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[26]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[27]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[28]  Martin Anthony,et al.  Generalization Error Bounds for Threshold Decision Lists , 2004, J. Mach. Learn. Res..

[29]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[30]  Zhengwu Yang,et al.  A near-optimal similarity join algorithm and performance evaluation , 2004, Inf. Sci..

[31]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[32]  Carlos Alberto Heuser,et al.  SimEval - A Tool for Evaluating the Quality of Similarity Functions , 2007, ER.

[33]  Carlos Alberto Heuser,et al.  Estimating Recall and Precision for Vague Queries in Databases , 2005, CAiSE.

[34]  K. Thangavel,et al.  Clustering Categorical Data Using Silhouette Coefficient as a Relocating Measure , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[35]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[36]  Amihai Motro,et al.  VAGUE: a user interface to relational databases that permits vague queries , 1988, TOIS.

[37]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[38]  Chung-Chian Hsu,et al.  Hierarchical clustering of mixed data based on distance hierarchy , 2007, Inf. Sci..

[39]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.