SimFusion: A Unified Similarity Measurement Algorithm for Multi-Type Interrelated Web Objects

In this paper, we use a Unified Relationship Matrix (URM) to represent a set of heterogeneous web objects (e.g., web pages, queries) and their interrelationships (e.g., hyperlink, user click-through relationships). We claim that iterative computations over the URM can help overcome the data sparseness problem (a common situation in the Web) and detect latent relationships among heterogeneous web objects, thus, can improve the quality of various information applications that require the combination of information from heterogeneous sources. To support our claim, we further propose a unified similarity-calculating algorithm, the SimFusion algorithm. By iteratively computing over the URM, the SimFusion algorithm can effectively integrate relationships from heterogeneous sources when measuring the similarity of two web objects. Experiments based on a real search engine query log and a large real web page collection demonstrate that the SimFusion algorithm can significantly improve similarity measurement of web objects over both traditional content based similarity-calculating algorithms and the cutting edge SimRank algorithm.

[1]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[2]  John Riedl,et al.  An algorithmic framework for performing collaborative filtering , 1999, SIGIR '99.

[3]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[4]  C. Lee Giles,et al.  Clustering and identifying temporal trends in document databases , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[5]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[6]  Hongjun Lu,et al.  ReCoM: reinforcement clustering of multi-type interrelated data objects , 2003, SIGIR.

[7]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[8]  Berthier A. Ribeiro-Neto,et al.  A belief network model for IR , 1996, SIGIR '96.

[9]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[10]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[11]  Edward A. Fox,et al.  Link fusion: a unified link analysis framework for multi-type interrelated data objects , 2004, WWW '04.

[12]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .

[13]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[14]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[15]  Peter Pirolli,et al.  Life, death, and lawfulness on the electronic frontier , 1997, CHI.

[16]  Edward Fox,et al.  Extending the boolean and vector space models of information retrieval with p-norm queries and multiple concept types , 1983 .

[17]  Brian D. Davison Toward a unification of text and link analysis , 2003, SIGIR.

[18]  Qiang Yang,et al.  Correlation-based document clustering using web logs , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[19]  Edward A. Fox,et al.  MRSSA: an iterative algorithm for similarity spreading over interrelated objects , 2004, CIKM '04.

[20]  Vijay V. Raghavan,et al.  On modeling of information retrieval concepts in vector spaces , 1987, TODS.

[21]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[22]  Vijay V. Raghavan,et al.  On the reuse of past optimal queries , 1995, SIGIR '95.

[23]  Wei-Ying Ma,et al.  Optimizing web search using web click-through data , 2004, CIKM '04.

[24]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[25]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[26]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[27]  S. Brereton Life , 1876, The Indian medical gazette.

[28]  Luis M. de Campos,et al.  An information retrieval model based on simple Bayesian networks , 2003, Int. J. Intell. Syst..

[29]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[30]  Berthier A. Ribeiro-Neto,et al.  An Information Retrieval Approach for Approximate Queries , 2003, IEEE Trans. Knowl. Data Eng..

[31]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[32]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[33]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[34]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[35]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[36]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[37]  Susan T. Dumais,et al.  Using latent semantic analysis to improve information retrieval , 1988, CHI 1988.