Generalized Mongue-Elkan Method for Approximate Text String Comparison

The Mongue-Elkan method is a general text string comparison method based on an internal character-based similarity measure (e.g. edit distance) combined with a token level (i.e. word level) similarity measure. We propose a generalization of this method based on the notion of the generalized arithmetic mean instead of the simple average used in the expression to calculate the Monge-Elkan method. The experiments carried out with 12 well-known name-matching data sets show that the proposed approach outperforms the original Monge-Elkan method when character-based measures are used to compare tokens.

[1]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[2]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[3]  Julian R. Ullmann,et al.  A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words , 1977, Comput. J..

[4]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[5]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[7]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[8]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[9]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[10]  Kalervo Järvelin,et al.  Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants , 2003, SPIRE.

[11]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[12]  Bernard De Baets,et al.  Transitivity-preserving fuzzification schemes for cardinality-based similarity measures , 2005, Eur. J. Oper. Res..

[13]  Craig A. Knoblock,et al.  A heterogeneous field matching method for record linkage , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[14]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[15]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[16]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[17]  Craig A. Knoblock,et al.  Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[18]  Alvaro E. Monge,et al.  AN ADAPTIVE AND EFFICIENT ALGORITHM FOR DETECTING APPROXIMATELY DUPLICATE DATABASE RECORDS , 2007 .

[19]  Piskorski Jakub Usability of String Distance Metrics for Name Matching Tasks in Polish , 2007 .

[20]  François Yvon,et al.  Robust Similarity Measures for Named Entities Matching , 2008, COLING.

[21]  Wojciech Rytter,et al.  Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[22]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.