Context-Aware Approximate String Matching for Large-Scale Real-Time Entity Resolution

Techniques for approximate string matching have been widely studied over several decades. They are required in many applications, including entity resolution, spell checking, similarity joins, and biological sequence comparison. Most existing techniques for approximate string matching used in entity resolution only consider the two strings that are compared. They neglect contextual information such as the frequency of how often strings occur in a database, the likelihood of the character edits between strings, or how many other similar strings there are in a database. In this paper we investigate if incorporating such contextual information into edit distance based approximate string matching can improve matching quality for real-time entity resolution. In this application, query records have to be matched in sub-second time to records in a large database that refer to the same entity. We evaluate our approach on two large real data sets and compare it to several baseline approaches. Our results show that considering edit frequency and the neighborhood size of a string can improve matching results, while taking string frequencies into account can actually make results worse.

[1]  Lise Getoor,et al.  Query-time entity resolution , 2006, KDD '06.

[2]  Alexandr Andoni,et al.  Approximating Edit Distance in Near-Linear Time , 2012, SIAM J. Comput..

[3]  Peter Christen,et al.  Forest-Based Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution , 2014, CIKM.

[4]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[5]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[6]  Huizhi Liang,et al.  Noise-Tolerant Approximate Blocking for Dynamic Real-Time Entity Resolution , 2014, PAKDD.

[7]  Leonid Zhukov,et al.  Parallel Corpus Approach for Name Matching in Record Linkage , 2014, 2014 IEEE International Conference on Data Mining.

[8]  Hannah Bast,et al.  Efficient fuzzy search in large text collections , 2013, TOIS.

[9]  Felix Naumann,et al.  Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate , 2011, CIKM '11.

[10]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[11]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[12]  Thai Ngoc Thuy ED-JOIN: AN EFFICIENT ALGORITHM FOR SIMILARITY JOINS WITH EDIT DISTANCE CONSTRAINTS , 2009 .

[13]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[14]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[15]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[16]  Alexandr Andoni,et al.  Approximating edit distance in near-linear time , 2009, STOC '09.

[17]  David Hawking,et al.  Similarity-aware indexing for real-time entity resolution , 2009, CIKM.

[18]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[19]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[20]  Paolo Ferragina String algorithms and data structures , 2008, ArXiv.

[21]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[22]  Nick Craswell Mean Reciprocal Rank , 2009, Encyclopedia of Database Systems.