Improvement of HITS-based algorithms on web documents

In this paper, we present two ways to improve the precision of HITS-based algorithms on Web documents. First, by analyzing the limitations of current HITS-based algorithms, we propose a new weighted HITS-based method that assigns appropriate weights to in-links of root documents. Then, we combine content analysis with HITS-based algorithms and study the effects of four representative relevance scoring methods, VSM, Okapi, TLS, and CDR, using a set of broad topic queries. Our experimental results show that our weighted HITS-based method performs significantly better than Bharat's improved HITS algorithm. When we combine our weighted HITS-based method or Bharat's HITS algorithm with any of the four relevance scoring methods, the combined methods are only marginally better than our weighted HITS-based method. Between the four relevance-scoring methods, there is no significant quality difference when they are combined with a HITS-based algorithm.

[1]  Charles L. A. Clarke,et al.  Shortest Substring Ranking (MultiText Experiments for TREC-4) , 1995, TREC.

[2]  Ophir Frieder,et al.  Integrating Structured Data and Text: A Relational Approach , 1997, J. Am. Soc. Inf. Sci..

[3]  Charles L. A. Clarke,et al.  Relevance ranking for one to three term queries , 1997, Inf. Process. Manag..

[4]  Jonathan Gratch,et al.  On the Efficient Allocation of Resources for Hypothesis Evaluation: A Statistical Approach , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[6]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[7]  David A. Cohn,et al.  Creating customized authority lists , 1999, ICML 1999.

[8]  Steven L. MacCall,et al.  A Relevance-based Quantitative Measure for Internet Information Retrieval Evaluation , 1999 .

[9]  Shlomo Moran,et al.  The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[10]  Longzhuang Li,et al.  A new statistical method for performance evaluation of search engines , 2000, Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000.

[11]  Gary Marchionini,et al.  A Comparative Study of Web Search Service Performance , 1996 .

[12]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[13]  Ron Sacks-Davis,et al.  Similarity Measures for Short Queries , 1995, TREC.

[14]  Gerard Salton,et al.  Document Length Normalization , 1995, Inf. Process. Manag..

[15]  Andrew McCallum,et al.  Learning to Create Customized Authority Lists , 2000, ICML.

[16]  Jaideep Srivastava,et al.  First 20 precision among World Wide Web search services (search engines) , 1999 .

[17]  Longzhuang Li,et al.  A new method for automatic performance comparison of search engines , 2004, World Wide Web.

[18]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[19]  Steve A. Chien,et al.  Efficient Heuristic Hypothesis Ranking , 1999, J. Artif. Intell. Res..

[20]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[21]  Peter Bailey,et al.  ACSys TREC-8 Experiments , 1999, TREC.

[22]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.

[23]  Peter Willett,et al.  Estimating the recall performance of Web search engines , 1997 .

[24]  Joel C. Miller,et al.  Modifications of Kleinberg's HITS algorithm using matrix exponentiation and web log records , 2001, SIGIR '01.