Toward detection of aliases without string similarity

Entity aliases commonly exist. Accurately detecting these aliases plays a vital role in various applications. In particular, it is critical to detect the aliases that are intentionally hidden from the real identities, such as those of terrorists and frauds. Most existing work does not pay close attention to the aliases that have low/no string similarity to the given entities. In this paper, we propose a classifier that is based on active learning for detecting this type of aliasing. To minimize the cost of pair-wise comparison, a subset-based method is designed to restrict the selection within entity subsets. An active learning classifier is then employed in each entity subset to find the probability of whether a candidate is the alias of a given entity within the subset. After all of the results from the classifier are integrated, a list of aliases is returned for each given entity. For evaluation, we implemented four state-of-the-art methods and compared them with our proposed approach on three datasets. The results clearly demonstrate that this new active learning classifier is superior to those existing methods.

[1]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[2]  Tossapon Boongoen,et al.  Disclosing false identity through hybrid link analysis , 2010, Artificial Intelligence and Law.

[3]  Christian Igel,et al.  Active learning with support vector machines , 2014, WIREs Data Mining Knowl. Discov..

[4]  Jason Baldridge,et al.  Active learning for HPSG parse selection , 2003, CoNLL.

[5]  Tim Oates,et al.  Using latent semantic analysis to find different names for the same entity in free text , 2002, WIDM '02.

[6]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[7]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[8]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[9]  Patrick Pantel,et al.  Alias Detection in Malicious Environments , 2006, AAAI Fall Symposium: Capturing and Using Patterns for Evidence Detection.

[10]  Jianyong Wang,et al.  Towards alias detection without string similarity: an active learning based approach , 2012, SIGIR '12.

[11]  Roney S Coimbra,et al.  Disclosing ambiguous gene aliases by automatic literature profiling , 2010, BMC Genomics.

[12]  David Guy Brizan,et al.  A. Survey of Entity Resolution and Record Linkage Methodologies , 2015, Communications of the IIMA.

[13]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[14]  Raghav Kaushik,et al.  On active learning of record matching packages , 2010, SIGMOD Conference.

[15]  Mang I Vai,et al.  Modelling cardiovascular physiological signals using adaptive Hermite and wavelet basis functions , 2010 .

[16]  Ming-Chui Dong,et al.  On decision making support in blood bank information systems , 2008, Expert Syst. Appl..

[17]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[18]  Danushka Bollegala,et al.  Identification of Personal Name Aliases on the Web , 2008 .

[19]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[20]  Ilango Krishnamurthi,et al.  Ranking semantic relationships between two entities using personalization in context specification , 2012, Inf. Sci..

[21]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[22]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[23]  Danushka Bollegala,et al.  Mining for personal name aliases on the web , 2008, WWW.

[24]  Bradley Malin,et al.  Email alias detection using social network analysis , 2005, LinkKDD '05.

[25]  N. Graham,et al.  Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation , 2002 .

[26]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[27]  Jesse Davis,et al.  Establishing Identity Equivalence in Multi-Relational Domains , 2005 .

[28]  Paul Hsiung,et al.  Alias Detection in Link Data Sets , 2004 .