Improving web search ranking by incorporating summarization

Though link analysis based page ranking approaches have reached great success in commercial search engines (SE), the content based relevance computing approaches also play a very important role in the ranking of information retrieval results. Since most of existing relevance computing algorithms are running on the full text of a web page, this paper is focused on the relevance computing between user's query and the auto-generated text summarization of each webpage. The first part of this paper provides a brief introduction of the state of art of relevance computing in SE. The inference network approach is especially concerned in this paper since it is the baseline method in our experiment SE system. Then the auto text summarization method based on multi-source integration is introduced, and the full text of each web page is replaced by its auto-generated abstract to compute the relevance between the webpage and user query. To evaluate the effect of the condensation representation of full text on the relevance based page rank of a system, several experiments are conducted in the last part of this paper, which include the method remarked above with different compress ratio, and the full text based ranking. In addition to the efficiency gain of the SE system, the experiment results also shows that the ranking results based on the summary generated by our text summarization system with 30% compress ratio can also get 11.29% of the precision improvement for the SE system.

[1]  Donald H. Kraft,et al.  SIGIR 2001 : proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in information Retrieval : New Orleans, Louisiana, USA, September 9-13, 2001 , 2001 .

[2]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.

[3]  Peter Jackson,et al.  Natural language processing for online applications : text retrieval, extraction and categorization , 2002 .

[4]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[5]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[6]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[7]  Frederick E. Petry,et al.  Genetic Algorithms , 1992 .

[8]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[9]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[10]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[11]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[12]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[13]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI): TREC-3 Report , 1994, TREC.

[14]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[15]  Dongsong Zhang,et al.  NLPIR: a Theoretical Framework for Applying Natural Language Processing to Information Retrieval , 2003, J. Assoc. Inf. Sci. Technol..

[16]  Mark Sanderson,et al.  Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[17]  David Hawking,et al.  Query-independent evidence in home page finding , 2003, TOIS.

[18]  Karen Spärck Jones,et al.  Generic summaries for indexing in information retrieval , 2001, SIGIR '01.

[19]  W. Bruce Croft,et al.  Indri: A language-model based search engine for complex queries1 , 2005 .