Hits on the web: how does it compare?

This paper describes a large-scale evaluation of theeffectiveness of HITS in comparison with other link-based rankingalgorithms, when used in combination with a state-of-the-art textretrieval algorithm exploiting anchor text. We quantified theireffectiveness using three common performance measures: the meanreciprocal rank, the mean average precision, and the normalizeddiscounted cumulative gain measurements. The evaluation is based ontwo large data sets: a breadth-first search crawl of 463 millionweb pages containing 17.6 billion hyperlinks and referencing 2.9billion distinct URLs; and a set of 28,043 queries sampled from aquery log, each query having on average 2,383 results, about 17 ofwhich were labeled by judges. We found that HITS outperformsPageRank, but is about as effective as web-page in-degree. The sameholds true when any of the link-based features are combined withthe text retrieval algorithm. Finally, we studied the relationshipbetween query specificity and the effectiveness of selectedfeatures, and found that link-based features perform better forgeneral queries, whereas BM25F performs better for specificqueries.

[1]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[2]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[3]  Michael I. Jordan,et al.  Stable algorithms for link analysis , 2001, SIGIR '01.

[4]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[5]  Allan Borodin,et al.  Finding authorities and hubs from link structures on the World Wide Web , 2001, WWW '01.

[6]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[7]  Shlomo Moran,et al.  The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[8]  Stephen E. Robertson,et al.  Relevance weighting for query independent evidence , 2005, SIGIR '05.

[9]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[10]  Allan Borodin,et al.  Link analysis ranking: algorithms, theory, and experiments , 2005, TOIT.

[11]  Franco Scarselli,et al.  Inside PageRank , 2005, TOIT.

[12]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[13]  David Hawking,et al.  Predicting Fame and Fortune: PageRank or Indegree? , 2003 .

[14]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[15]  Loren G. Terveen,et al.  Does “authority” mean quality? predicting expert quality ratings of Web documents , 2000, SIGIR '00.

[16]  John A. Tomlin,et al.  A new paradigm for ranking pages on the world wide web , 2003, WWW '03.

[17]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[18]  Stephen E. Robertson,et al.  Microsoft Cambridge at TREC 13: Web and Hard Tracks , 2004, TREC.

[19]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[20]  E. Garfield Citation analysis as a tool in journal evaluation. , 1972, Science.

[21]  Gene H. Golub,et al.  Extrapolation methods for accelerating PageRank computations , 2003, WWW '03.

[22]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.