Automatic ranking of information retrieval systems using data fusion

Measuring effectiveness of information retrieval (IR) systems is essential for research and development and for monitoring search quality in dynamic environments. In this study, we employ new methods for automatic ranking of retrieval systems. In these methods, we merge the retrieval results of multiple systems using various data fusion algorithms, use the top-ranked documents in the merged result as the "(pseudo) relevant documents," and employ these documents to evaluate and rank the systems. Experiments using Text REtrieval Conference (TREC) data provide statistically significant strong correlations with human-based assessments of the same systems. We hypothesize that the selection of systems that would return documents different from the majority could eliminate the ordinary systems from data fusion and provide better discrimination among the documents and systems. This could improve the effectiveness of automatic ranking. Based on this intuition, we introduce a new method for the selection of systems to be used for data fusion. For this purpose, we use the bias concept that measures the deviation of a system from the norm or majority and employ the systems with higher bias in the data fusion process. This approach provides even higher correlations with the human-based results. We demonstrate that our approach outperforms the previously proposed automatic ranking methods.

[1]  Ophir Frieder,et al.  Disproving the fusion hypothesis: an analysis of data fusion via effective information retrieval strategies , 2003, SAC '03.

[2]  Rabia Nuray-Turan,et al.  Automatic ranking of retrieval systems in imperfect environments , 2003, SIGIR '03.

[3]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[4]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[5]  Abdur Chowdhury,et al.  Using titles and category names from editor-driven taxonomies for automatic evaluation , 2003, CIKM '03.

[6]  Oren Etzioni,et al.  Multi-Engine Search and Comparison Using the MetaCrawler , 1995, World Wide Web J..

[7]  Abbe Mowshowitz,et al.  Assessing bias in search engines , 2002, Inf. Process. Manag..

[8]  David Carmel,et al.  Scaling IR-system evaluation using term relevance sets , 2004, SIGIR '04.

[9]  Mark A. Davenport,et al.  Using SPSS to Solve Statistical Problems: A Self-Instruction Guide , 2000 .

[10]  A. Sen,et al.  Collective Choice and Social Welfare , 2017 .

[11]  F. Roberts Discrete Mathematical Models with Applications to Social, Biological, and Environmental Problems. , 1976 .

[12]  W. Bruce Croft Advances in Informational Retrieval: Recent Research from the Center for Intelligent Information Retrieval , 2000 .

[13]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.

[14]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[15]  Harris Wu,et al.  The effects of fitness functions on genetic programming-based ranking discovery for Web search: Research Articles , 2004 .

[16]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[17]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[18]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[19]  Shengli Wu,et al.  Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.

[20]  Javed A. Aslam,et al.  Condorcet fusion for improved retrieval , 2002, CIKM '02.

[21]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[22]  Miguel Á. Carreira-Perpiñán,et al.  Estimating Precision by Random Sampling , 2005 .

[23]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[24]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[25]  Harris Wu,et al.  The effects of fitness functions on genetic programming-based ranking discovery forWeb search , 2004, J. Assoc. Inf. Sci. Technol..

[26]  Rabia Nuray Automatic performance evaluation of information retrieval systems using data fusion , 2003 .

[27]  Ondrej Lhoták,et al.  Estimating precision by random sampling (poster abstract) , 1999, SIGIR '99.

[28]  Daniel Lehmann,et al.  Representing and Aggregating Conflicting Beliefs , 2000, KR.

[29]  Oren Etzioni,et al.  Multi-Service Search and Comparison Using the MetaCrawler , 1995 .

[30]  W. Bruce Croft Combining Approaches to Information Retrieval , 2002 .

[31]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[32]  Rabia Nuray-Turan,et al.  Automatic performance evaluation of Web search engines , 2004, Inf. Process. Manag..

[33]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[34]  Stephen P. Harter,et al.  Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness , 1996, J. Am. Soc. Inf. Sci..

[35]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .