A Comparison of String Similarity Measures for Toponym Matching

The diversity of ways in which toponyms are specified often results in mismatches between queries and the place names contained in gazetteers. Search terms that include unofficial variants of official place names, unanticipated transliterations, and typos are frequently similar but not identical to the place names contained in the gazetteer. String similarity measures can mitigate this problem, but given their task-dependent performance, the optimal choice of measure is unclear. We constructed a task in which place names had to be matched to variants of those names listed in the GEOnet Names Server, comparing 21 different measures on datasets containing romanized toponyms from 11 different countries. Best-performing measures varied widely across datasets, but were highly consistent within-country and within-language. We discuss which measures worked best for particular languages and provide recommendations for selecting appropriate string similarity measures.

[1]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[2]  Bruno Martins A Supervised Machine Learning Approach for Duplicate Detection over Gazetteer Records , 2011, GeoS.

[3]  William E. Winkler,et al.  Approximate String Comparison and its Effect on an Advanced Record Linkage System , 1997 .

[4]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[5]  Ilaria Bartolini,et al.  String Matching with Metric Trees Using an Approximate Distance , 2002, SPIRE.

[6]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[7]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[8]  J. T. Hastings,et al.  Automated conflation of digital gazetteer data , 2008, Int. J. Geogr. Inf. Sci..

[9]  Peter Willett,et al.  An evaluation of some conflation algorithms for information retrieval , 1981 .

[10]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[11]  W. Maki,et al.  Latent structure in measures of associative, semantic, and thematic knowledge , 2008, Psychonomic bulletin & review.

[12]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[13]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[14]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[15]  Ruibin Gong,et al.  Syllable Alignment: A Novel Model for Phonetic String Search , 2006, IEICE Trans. Inf. Syst..

[16]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[17]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[18]  T. N. Gadd,et al.  PHOENIX: the algorithm , 1990 .

[19]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[20]  Peter Christen,et al.  Febrl - A Parallel Open Source Data Linkage System: http://datamining.anu.edu.au/linkage.html , 2004, PAKDD.

[21]  George Kachergis,et al.  Toward a scalable holographic word-form representation , 2011, Behavior research methods.

[22]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[23]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[24]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[25]  Jason W. Osborne,et al.  Best practices in exploratory factor analysis: four recommendations for getting the most from your analysis. , 2005 .

[26]  Lise Getoor,et al.  Entity resolution in geospatial data integration , 2006, GIS '06.

[27]  Xing Xie,et al.  Detecting nearly duplicated records in location datasets , 2010, GIS '10.

[28]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[29]  C. Whitney How the brain encodes the order of letters in a printed word: The SERIOL model and selective literature review , 2001, Psychonomic bulletin & review.

[30]  C Friedman,et al.  Tolerating spelling errors during patient validation. , 1992, Computers and biomedical research, an international journal.

[31]  Kalervo Järvelin,et al.  Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants , 2003, SPIRE.

[32]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[33]  Pável Calado,et al.  Supervised Learning for Linking Named Entities to Knowledge Base Entries , 2011, TAC.