A Supervised Machine Learning Approach for Duplicate Detection over Gazetteer Records

This paper presents a novel approach for detecting duplicate records in the context of digital gazetteers, using state-of-the-art machine learning techniques. It reports a thorough evaluation of alternative machine learning approaches designed for the task of classifying pairs of gazetteer records as either duplicates or not, built by using support vector machines or alternating decision trees with different combinations of similarity scores for the feature vectors. Experimental results show that using feature vectors that combine multiple similarity scores, derived from place names, semantic relationships, place types and geospatial footprints, leads to an increase in accuracy. The paper also discusses how the proposed duplicate detection approach can scale to large collections, through the usage of filtering or blocking techniques.

[1]  Javier M. Moguerza,et al.  Support Vector Machines with Applications , 2006, math/0612817.

[2]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[3]  Lawrence Philips,et al.  The double metaphone search algorithm , 2000 .

[4]  Linda L. Hill Georeferencing - The Geographic Associations of Information , 2009, Digital libraries and electronic publishing.

[5]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[6]  Xing Xie,et al.  Detecting nearly duplicated records in location datasets , 2010, GIS '10.

[7]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[8]  Lise Getoor,et al.  GeoDDupe: A Novel Interface for Interactive Entity Resolution in Geospatial Data , 2007, 2007 11th International Conference Information Visualization (IV '07).

[9]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[10]  Catriel Beeri,et al.  Object Fusion in Geographic Information Systems , 2004, VLDB.

[11]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[12]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[13]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[14]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[15]  Linda L. Hill Georeferencing: The Geographic Associations of Information (Digital Libraries and Electronic Publishing) , 2006 .

[16]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[17]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[18]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[19]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[20]  Clodoveu A. Davis,et al.  Approximate String Matching for Geographic Names and Personal Names , 2007, GEOINFO.

[21]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[22]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[23]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[24]  J. T. Hastings,et al.  Automated conflation of digital gazetteer data , 2008, Int. J. Geogr. Inf. Sci..

[25]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[26]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[27]  Yu Deng,et al.  Finding Similar Objects Using a Taxonomy: A Pragmatic Approach , 2006, OTM Conferences.

[28]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[29]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[30]  Lise Getoor,et al.  Entity resolution in geospatial data integration , 2006, GIS '06.

[31]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[32]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[33]  Ashok Samal,et al.  A feature-based approach to conflation of geospatial sources , 2004, Int. J. Geogr. Inf. Sci..

[34]  Linda L. Hill,et al.  Core Elements of Digital Gazetteers: Placenames, Categories, and Footprints , 2000, ECDL.

[35]  Mikhail Bilenko and Raymond J. Mooney,et al.  On Evaluation and Training-Set Construction for Duplicate Detection , 2003 .

[36]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[37]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[38]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[39]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.