SimFusion: measuring similarity using unified relationship matrix

In this paper we use a Unified Relationship Matrix (URM) to represent a set of heterogeneous data objects (e.g., web pages, queries) and their interrelationships (e.g., hyperlinks, user click-through sequences). We claim that iterative computations over the URM can help overcome the data sparseness problem and detect latent relationships among heterogeneous data objects, thus, can improve the quality of information applications that require com- bination of information from heterogeneous sources. To support our claim, we present a unified similarity-calculating algorithm, SimFusion. By iteratively computing over the URM, SimFusion can effectively integrate relationships from heterogeneous sources when measuring the similarity of two data objects. Experiments based on a web search engine query log and a web page collection demonstrate that SimFusion can improve similarity measurement of web objects over both traditional content based algorithms and the cutting edge SimRank algorithm.

[1]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[2]  Vijay V. Raghavan,et al.  On the reuse of past optimal queries , 1995, SIGIR '95.

[3]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[4]  Edward A. Fox,et al.  MRSSA: an iterative algorithm for similarity spreading over interrelated objects , 2004, CIKM '04.

[5]  C. Lee Giles,et al.  Clustering and identifying temporal trends in document databases , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[6]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[7]  Vijay V. Raghavan,et al.  On modeling of information retrieval concepts in vector spaces , 1987, TODS.

[8]  FuhrNorbert,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997 .

[9]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[10]  Edward A. Fox,et al.  Link fusion: a unified link analysis framework for multi-type interrelated data objects , 2004, WWW '04.

[11]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .

[12]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[13]  Heikki Mannila,et al.  Similarity of Attributes by External Probes , 1998, KDD.

[14]  Susan T. Dumais,et al.  Using latent semantic analysis to improve information retrieval , 1988, CHI 1988.

[15]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[16]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[17]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[18]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[19]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[20]  Berthier A. Ribeiro-Neto,et al.  An Information Retrieval Approach for Approximate Queries , 2003, IEEE Trans. Knowl. Data Eng..

[21]  Wei-Ying Ma,et al.  A Similarity Reinforcement Algorithm for Heterogeneous Web Pages , 2005, APWeb.

[22]  Berthier A. Ribeiro-Neto,et al.  A belief network model for IR , 1996, SIGIR '96.

[23]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[24]  Edward A. Fox,et al.  SimFusion: A Unified Similarity Measurement Algorithm for Multi-Type Interrelated Web Objects , 2004 .

[25]  Brian D. Davison Toward a unification of text and link analysis , 2003, SIGIR.

[26]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[27]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[28]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[29]  John Riedl,et al.  An algorithmic framework for performing collaborative filtering , 1999, SIGIR '99.

[30]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[31]  Vannevar Bush,et al.  As we may think , 1945, INTR.

[32]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[33]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .