Incorporating robustness into web ranking evaluation

In many Web search engines, a ranking function is selected for deployment mainly by comparing the relevance measurements over candidates. Due to the dynamical nature of the Web, the ranking features and the query and URL distribution on which the ranking functions are built, may change dramatically over time. The actual relevance of the function may degrade, and thus the previous function selection conclusions become invalid. In this work we suggest to select Web ranking functions according to both their relevance and robustness to the changes that may lead to relevance degradation over time. We argue that the ranking robustness can be effectively measured by taking into account the ranking score distribution across search results. We then propose two alternatives to the NDCG metric that both incorporate ranking robustness into ranking function evaluation and selection. A machine learning approach is developed to learn the parameters that control the metric sensitivity to score turbulence, from human-judged preference data.