Swash: A collective personal name matching framework

Abstract Having a unique personal identifier is a prerequisite to run person-centric analytical queries and data mining tasks, such as fraud detection, expert finding, and credit scoring. Personal names are the most commonly used identifier of individuals in datasets; however, the name of a person may not be unique across the dataset's records, especially where data are integrated from various sources. Intelligent systems utilize name matching methods to identify different name representations of persons. The performance of previous name matching methods is inadequate since they solely consider name similarities and ignore dissimilarities. Unavailability of Part of Name (PON, e.g., first name and last name) is an important limitation of dissimilarity consideration. To address this issue, this paper proposes an unsupervised personal name matching framework, namely Swash. This framework can model the information gatherable from a name dataset into a layered Heterogeneous Information Network, which facilitates control over the learning process. Swash predicts PON tags using a self-trainable algorithm and then collectively clusters the name vertices on the network. Evaluations on three public bibliographic datasets (i.e., CiteSeer, ArXiv, and DBLP) recognize the significance of the proposed framework. The results showed that Swash outperformed F1 of the state-of-the-art method up to 38%.

[1]  Patrick Reuther Personal Name Matching: New Test Collections and a Social Network based Approach , 2006, Universität Trier, Mathematik/Informatik, Forschungsbericht.

[2]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  Khaled Shaalan,et al.  Person Name Entity Recognition for Arabic , 2007, SEMITIC@ACL.

[4]  Kuansan Wang,et al.  Web scale NLP: a case study on url word breaking , 2011, WWW.

[5]  Philip S. Yu,et al.  A Survey of Heterogeneous Information Network Analysis , 2015, IEEE Transactions on Knowledge and Data Engineering.

[6]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  Lin Li,et al.  A comparison of techniques for name matching. , 2012 .

[8]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[9]  Félix Moya-Anegón,et al.  Approximate personal name-matching through finite-state graphs , 2007 .

[10]  Leonid Zhukov,et al.  Parallel Corpus Approach for Name Matching in Record Linkage , 2014, 2014 IEEE International Conference on Data Mining.

[11]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[12]  Coskun Bayrak,et al.  Hybrid Matching Algorithm for Personal Names , 2012, JDIQ.

[13]  Nanyun Peng,et al.  An Empirical Study of Chinese Name Matching and Applications , 2015, ACL.

[14]  Cherif Salama,et al.  A hybrid cross-language name matching technique using novel modified Levenshtein Distance , 2015, 2015 Tenth International Conference on Computer Engineering & Systems (ICCES).

[15]  Pasi Fränti,et al.  Similarity measures for title matching , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[16]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[17]  David D. Jensen,et al.  Exploiting relational structure to understand publication patterns in high-energy physics , 2003, SKDD.

[18]  Keith J. Miller,et al.  A Ground Truth Dataset for Matching Culturally Diverse Romanized Person Names , 2008, LREC.

[19]  Pavel Braslavski,et al.  Personal Names Popularity Estimation and Its Application to Record Linkage , 2018, ADBIS.

[20]  Douglas W. Oard,et al.  Matching person names through name transformation , 2009, CIKM.

[21]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[22]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Martin Rosvall,et al.  Maps of random walks on complex networks reveal community structure , 2007, Proceedings of the National Academy of Sciences.

[24]  Andrew McCallum,et al.  Joint deduplication of multiple record types in relational data , 2005, CIKM '05.

[25]  C. Lee Giles,et al.  Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching , 2012, AAAI.

[26]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[27]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[28]  Peter Christen,et al.  Context-Aware Approximate String Matching for Large-Scale Real-Time Entity Resolution , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[29]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[30]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.