Distant Meta-Path Similarities for Text-Based Heterogeneous Information Networks

Measuring network similarity is a fundamental data mining problem. The mainstream similarity measures mainly leverage the structural information regarding to the entities in the network without considering the network semantics. In the real world, the heterogeneous information networks (HINs) with rich semantics are ubiquitous. However, the existing network similarity doesn't generalize well in HINs because they fail to capture the HIN semantics. The meta-path has been proposed and demonstrated as a right way to represent semantics in HINs. Therefore, original meta-path based similarities (e.g., PathSim and KnowSim) have been successful in computing the entity proximity in HINs. The intuition is that the more instances of meta-path(s) between entities, the more similar the entities are. Thus the original meta-path similarity only applies to computing the proximity of two neighborhood (connected) entities. In this paper, we propose the distant meta-path similarity that is able to capture HIN semantics between two distant (isolated) entities to provide more meaningful entity proximity. The main idea is that even there is no shared neighborhood entities of (i.e., no meta-path instances connecting) the two entities, but if the more similar neighborhood entities of the entities are, the more similar the two entities should be. We then find out the optimum distant meta-path similarity by exploring the similarity hypothesis space based on different theoretical foundations. We show the state-of-the-art similarity performance of distant meta-path similarity on two text-based HINs and make the datasets public available.

[1]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[2]  Elena Deza,et al.  Encyclopedia of Distances , 2014 .

[3]  Pranesh Kumar,et al.  ON A SYMMETRIC DIVERGENCE MEASURE AND INFORMATION INEQUALITIES , 2005 .

[4]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[5]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[6]  D. Gavin,et al.  A statistical approach to evaluating distance metrics and analog assignments for pollen records , 2003, Quaternary Research.

[7]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[8]  Jiawei Han,et al.  KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks , 2015, 2015 IEEE International Conference on Data Mining.

[9]  Philip S. Yu,et al.  Integrating meta-path selection with user-guided object clustering in heterogeneous information networks , 2012, KDD.

[10]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[11]  J. T. Curtis,et al.  An Ordination of the Upland Forest Communities of Southern Wisconsin , 1957 .

[12]  森下 Measuring of interspecific association and similarity between communities. , 1961 .

[13]  Pavel Zezula,et al.  Similarity Search: The Metric Space Approach (Advances in Database Systems) , 2005 .

[14]  L. Miles,et al.  2000 , 2000, RDH.

[15]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[16]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[17]  T. Subba Rao,et al.  Classification, Parameter Estimation and State Estimation: An Engineering Approach Using MATLAB , 2004 .

[18]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[19]  Philip S. Yu,et al.  Mining knowledge from databases: an information network analysis approach , 2010, SIGMOD Conference.

[20]  I. J. Taneja New Developments in Generalized Information Measures , 1995 .

[21]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[22]  B. Kumar,et al.  Performance measures for correlation filters. , 1990, Applied optics.

[23]  Bangjun Lei,et al.  Classification, Parameter Estimation and State Estimation: An Engineering Approach Using MATLAB, 2nd Edition , 2017 .

[24]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[25]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[26]  Wenwu Zhu,et al.  Structural Deep Network Embedding , 2016, KDD.

[27]  Yizhou Sun,et al.  Mining Heterogeneous Information Networks: Principles and Methodologies , 2012, Mining Heterogeneous Information Networks: Principles and Methodologies.

[28]  Dan Roth,et al.  Incorporating World Knowledge to Document Clustering via Heterogeneous Information Networks , 2015, KDD.

[29]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[30]  Flemming Topsøe,et al.  Some inequalities for information divergence and related measures of discrimination , 2000, IEEE Trans. Inf. Theory.

[31]  K. Pearson On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1900 .

[32]  Jiawei Han,et al.  Text Classification with Heterogeneous Information Network Kernels , 2016, AAAI.

[33]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[34]  David G. Stork,et al.  Pattern Classification , 1973 .

[35]  Karl Pearson F.R.S. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .

[36]  O. E. Polansky,et al.  Introduction to Similarity Searching in Chemistry , 2004 .

[37]  K. Matusita Decision Rules, Based on the Distance, for Problems of Fit, Two Samples, and Estimation , 1955 .

[38]  Philip S. Yu,et al.  Inferring anchor links across multiple heterogeneous social networks , 2013, CIKM.

[39]  LiFan,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004 .

[40]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[41]  E. Krause,et al.  Taxicab Geometry: An Adventure in Non-Euclidean Geometry , 1987 .

[42]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[43]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[44]  Terry Hedges,et al.  An empirical modification to linear wave theory , 1977 .

[45]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.