Approximate String Matching by End-Users using Active Learning

Identifying approximately identical strings is key for many data cleaning and data integration processes, including similarity join and record matching. The accuracy of such tasks crucially depends on appropriate choices of string similarity measures and thresholds for the particular dataset. Manual selection of similarity measures and thresholds is infeasible. Other approaches rely on the existence of adequate historic ground-truth or massive manual effort. To address this problem, we propose an Active Learning algorithm which selects a best performing similarity measure in a given set while optimizing a decision threshold. Active Learning minimizes the number of user queries needed to arrive at an appropriate classifier. Queries require only the label match/no match, which end users can easily provide in their domain. Evaluation on well-known string matching benchmark data sets shows that our approach achieves highly accurate results with a small amount of manual labeling required.

[1]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[2]  Data Matching , 2017, Encyclopedia of Machine Learning and Data Mining.

[3]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[4]  Foster J. Provost,et al.  Inactive learning?: difficulties employing active learning in practice , 2011, SKDD.

[5]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[6]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[7]  Avi Arampatzis,et al.  The score-distributional threshold optimization for adaptive binary classification tasks , 2001, SIGIR '01.

[8]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[9]  Andreas Noack,et al.  Modularity clustering is force-directed layout , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[11]  Hector Garcia-Molina,et al.  Evaluating entity resolution results , 2010, Proc. VLDB Endow..

[12]  Michael Stonebraker,et al.  Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[13]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[14]  Carlos Alberto Heuser,et al.  Measuring quality of similarity functions in approximate data matching , 2007, J. Informetrics.

[15]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[16]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[17]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[18]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[19]  Elena Console,et al.  Data Fusion , 2009, Encyclopedia of Database Systems.