Entity disambiguation in anonymized graphs using graph kernels

This paper presents a novel method for entity disambiguation in anonymized graphs using local neighborhood structure. Most existing approaches leverage node information, which might not be available in several contexts due to privacy concerns, or information about the sources of the data. We consider this problem in the supervised setting where we are provided only with a base graph and a set of nodes labelled as ambiguous or unambiguous. We characterize the similarity between two nodes based on their local neighborhood structure using graph kernels; and solve the resulting classification task using SVMs. We give empirical evidence on two real-world datasets, comparing our approach to a state-of-the-art method, highlighting the advantages of our approach. We show that using less information, our method is significantly better in terms of either speed or accuracy or both. We also present extensions of two existing graphs kernels, namely, the direct product kernel and the shortest-path kernel, with significant improvements in accuracy. For the direct product kernel, our extension also provides significant computational benefits. Moreover, we design and implement the algorithms of our method to work in a distributed fashion using the GraphLab framework, ensuring high scalability.

[1]  Douglas W. Oard,et al.  Resolving Personal Names in Email Using Context Expansion , 2008, ACL.

[2]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[3]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[4]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[5]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[6]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[7]  Nilesh N. Dalvi,et al.  Large-Scale Collective Entity Matching , 2011, Proc. VLDB Endow..

[8]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[9]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[10]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[11]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[12]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[13]  Karsten M. Borgwardt,et al.  Fast subtree kernels on graphs , 2009, NIPS.

[14]  Tristan Fletcher,et al.  Support Vector Machines Explained , 2008 .

[15]  Gordon F. Royle,et al.  Algebraic Graph Theory , 2001, Graduate texts in mathematics.

[16]  Dmitri V. Kalashnikov,et al.  Exploiting Relationships for Domain-Independent Data Cleaning , 2005, SDM.

[17]  Hans-Peter Kriegel,et al.  Shortest-path kernels on graphs , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[18]  Kurt Mehlhorn,et al.  Efficient graphlet kernels for large graph comparison , 2009, AISTATS.

[19]  Joe Brewer,et al.  Kronecker products and matrix calculus in system theory , 1978 .

[20]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[21]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[22]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[23]  Aditya Krishna Menon,et al.  Large-Scale Support Vector Machines: Algorithms and Theory , 2009 .

[24]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[25]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[26]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[27]  Lise Getoor,et al.  Query-time entity resolution , 2006, KDD '06.

[28]  Dmitri V. Kalashnikov,et al.  Exploiting relationships for object consolidation , 2005, IQIS '05.

[29]  Risi Kondor,et al.  Diffusion kernels on graphs and other discrete structures , 2002, ICML 2002.

[30]  Bradley Malin,et al.  Unsupervised Name Disambiguation via Social Network Similarity , 2005 .

[31]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[32]  John D. Lafferty,et al.  Diffusion Kernels on Graphs and Other Discrete Input Spaces , 2002, ICML.

[33]  Edoardo M. Airoldi,et al.  A Network Analysis Model for Disambiguation of Names in Lists , 2005, Comput. Math. Organ. Theory.

[34]  Francis R. Bach,et al.  Graph kernels between point clouds , 2007, ICML '08.

[35]  Philip S. Yu,et al.  A Condensation Approach to Privacy Preserving Data Mining , 2004, EDBT.

[36]  S. V. N. Vishwanathan,et al.  Graph kernels , 2007 .

[37]  Hillol Kargupta,et al.  Privacy-Preserving Data Analysis on Graphs and Social Networks , 2008, Next Generation of Data Mining.

[38]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[39]  Hans-Peter Kriegel,et al.  Protein function prediction via graph kernels , 2005, ISMB.

[40]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[41]  Gerhard Weikum,et al.  LINDA: distributed web-of-data-scale entity matching , 2012, CIKM.

[42]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[43]  Lise Getoor,et al.  Name Reference Resolution in Organizational Email Archives , 2006, SDM.

[44]  Kurt Mehlhorn,et al.  Weisfeiler-Lehman Graph Kernels , 2011, J. Mach. Learn. Res..

[45]  Tatsuya Akutsu,et al.  Extensions of marginalized graph kernels , 2004, ICML.

[46]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[47]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[48]  Jimeng Sun,et al.  Fast Random Walk Graph Kernel , 2012, SDM.

[49]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[50]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[51]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[52]  Prithviraj Sen,et al.  Collective context-aware topic models for entity disambiguation , 2012, WWW.