An evolutionary approach for combining different sources of evidence in search engines

Modern Web search engines use different strategies to improve the overall quality of their document rankings. Usually the strategy adopted involves the combination of multiple sources of relevance into a single ranking. This work proposes the use of evolutionary techniques to derive good evidence combination functions using three different sources of evidence of relevance: the textual content of documents, the reputation of documents extracted from the connectivity information available in the processed collection and the anchor text concatenation. The combination functions discovered by our evolutionary strategies were tested using a collection containing 368 queries extracted from a real nation-wide search engine query log with over 12 million documents. The experiments performed indicate that our proposal is an effective and practical alternative for combining sources of evidence into a single ranking. We also show that different types of queries submitted to a search engine can require different combination functions and that our proposal is useful for coping with such differences.

[1]  David Hawking,et al.  Query-independent evidence in home page finding , 2003, TOIS.

[2]  Donna Harman,et al.  The fourth text REtrieval conference , 1996 .

[3]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[4]  Donna K. Harman,et al.  Results and Challenges in Web Search Evaluation , 1999, Comput. Networks.

[5]  Weiguo Fan,et al.  Discovery of context-specific ranking functions for effective information retrieval using genetic programming , 2004, IEEE Transactions on Knowledge and Data Engineering.

[6]  Djoerd Hiemstra,et al.  Retrieving Web Pages Using Content, Links, URLs and Anchors , 2001, TREC.

[7]  Ophir Frieder,et al.  Analyses of multiple-evidence combinations for retrieval strategies , 2001, SIGIR '01.

[8]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[9]  Filip Radlinski,et al.  A support vector method for optimizing average precision , 2007, SIGIR.

[10]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[11]  Ramesh Nallapati,et al.  Discriminative models for information retrieval , 2004, SIGIR '04.

[12]  Hsinchun Chen Machine learning for information retrieval: neural networks, symbolic learning, and genetic algorithms , 1995 .

[13]  Weiguo Fan,et al.  Genetic Programming-Based Discovery of Ranking Functions for Effective Web Search , 2005, J. Manag. Inf. Syst..

[14]  Hsinchun Chen,et al.  Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning, and Genetic Algorithms , 1995, J. Am. Soc. Inf. Sci..

[15]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[16]  Andrew Trotman,et al.  Learning to Rank , 2005, Information Retrieval.

[17]  Kevin S. McCurley,et al.  Analysis of anchor text for web search , 2003, SIGIR.

[18]  Gerard Salton,et al.  Associative Document Retrieval Techniques Using Bibliographic Information , 1963, JACM.

[19]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[20]  In-Ho Kang,et al.  Integration of multiple evidences based on a query type for web search , 2004, Inf. Process. Manag..

[21]  Tie-Yan Liu,et al.  Adapting ranking SVM to document retrieval , 2006, SIGIR.

[22]  Berthier A. Ribeiro-Neto,et al.  A belief network model for IR , 1996, SIGIR '96.

[23]  Berthier A. Ribeiro-Neto,et al.  Combining link-based and content-based methods for web document classification , 2003, CIKM '03.

[24]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[25]  Shlomo Moran,et al.  SALSA: the stochastic approach for link-structure analysis , 2001, TOIS.

[26]  Stephen E. Robertson,et al.  Relevance weighting for query independent evidence , 2005, SIGIR '05.

[27]  Edward A. Fox,et al.  Ranking function optimization for effective Web search by genetic programming: an empirical study , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[28]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[29]  Ravi Kumar,et al.  Core algorithms in the CLEVER system , 2006, TOIT.

[30]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[31]  Jianfeng Gao,et al.  Linear discriminant model for information retrieval , 2005, SIGIR '05.

[32]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[33]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[34]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[35]  Tao Qin,et al.  Ranking with multiple hyperplanes , 2007, SIGIR.

[36]  Berthier A. Ribeiro-Neto,et al.  Link-based and content-based evidential information in a belief network model , 2000, SIGIR '00.

[37]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[38]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[39]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[40]  Peter Nordin,et al.  Genetic programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications , 1998 .

[41]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[42]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.