HBGSim: A structural similarity measurement over heterogeneous big graphs

Similarity measurement is fundamental to many data mining and information retrieval tasks such as link prediction and relevance-based search. Conventional similarity measurement relies more on homogenous linkage relation and content information. However, these measurements cannot take full advantage of the data structure as heterogenous graph gains increasing popularity. Moreover, the scalability of these methods also faces challenge with the never-ending growth of big data in real world. In this paper, we propose a new similarity measurement called HBGSim based on the heterogeneous structured data. HBGSim combines both local and global features by a two-stage process. We make a comparison between our measurement and some traditional methods on DBLP1 dataset for evaluation and the experimental results show that our method outperforms the others.

[1]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[2]  Bradley N. Miller,et al.  GroupLens: applying collaborative filtering to Usenet news , 1997, CACM.

[3]  Fan Chung Graham,et al.  Local Partitioning for Directed Graphs Using PageRank , 2007, Internet Math..

[4]  Fan Chung Graham,et al.  Local Graph Partitioning using PageRank Vectors , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[5]  Douglas B. Terry,et al.  Using collaborative filtering to weave an information tapestry , 1992, CACM.

[6]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[7]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[8]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .

[9]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[10]  Lo Yang,et al.  Four proofs for the Cheeger inequality and graph partition algorithms , 2010 .

[11]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[12]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[13]  Xiaowei Xu,et al.  SCAN: a structural clustering algorithm for networks , 2007, KDD '07.

[14]  Jon Kleinberg,et al.  The link prediction problem for social networks , 2003, CIKM '03.

[15]  M. Newman Clustering and preferential attachment in growing networks. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[16]  Timo Teräsvirta,et al.  Smooth transition autoregressive models - A survey of recent developments , 2000 .

[17]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[18]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[19]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[20]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[21]  Philip S. Yu,et al.  Relevance search in heterogeneous networks , 2012, EDBT '12.

[22]  Yan Zhang,et al.  HN-Sim: A Structural Similarity Measure over Object-Behavior Networks , 2013, ADMA.

[23]  Eleazar Eskin,et al.  Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning , 1999, EMNLP.

[24]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[25]  David F. Gleich,et al.  Algorithms and Models for the Web Graph , 2014, Lecture Notes in Computer Science.

[26]  Yizhou Sun,et al.  P-Rank: a comprehensive structural similarity measure over information networks , 2009, CIKM.