Explicit web search result diversification

Queries submitted to a web search engine are typically short and often ambiguous. With the enormous size of the Web, a misunderstanding of the information need underlying an ambiguous query can misguide the search engine, ultimately leading the user to abandon the originally submitted query. In order to overcome this problem, a sensible approach is to diversify the documents retrieved for the user's query. As a result, the likelihood that at least one of these documents will satisfy the user's actual information need is increased. In this thesis, we argue that an ambiguous query should be seen as representing not one, but multiple information needs. Based upon this premise, we propose xQuAD---Explicit Query Aspect Diversification, a novel probabilistic framework for search result diversification. In particular, the xQuAD framework naturally models several dimensions of the search result diversification problem in a principled yet practical manner. To this end, the framework represents the possible information needs underlying a query as a set of keyword-based sub-queries. Moreover, xQuAD accounts for the overall coverage of each retrieved document with respect to the identified sub-queries, so as to rank highly diverse documents first. In addition, it accounts for how well each sub-query is covered by the other retrieved documents, so as to promote novelty--and hence penalise redundancy---in the ranking. The framework also models the importance of each of the identified sub-queries, so as to appropriately cater for the interests of the user population when diversifying the retrieved documents. Finally, since not all queries are equally ambiguous, the xQuAD framework caters for the ambiguity level of different queries, so as to appropriately trade-off relevance for diversity on a per-query basis. The xQuAD framework is general and can be used to instantiate several diversification models, including the most prominent models described in the literature. In particular, within xQuAD, each of the aforementioned dimensions of the search result diversification problem can be tackled in a variety of ways. In this thesis, as additional contributions besides the xQuAD framework, we introduce novel machine learning approaches for addressing each of these dimensions. These include a learning to rank approach for identifying effective subqueries as query suggestions mined from a query log, an intent-aware approach for choosing the ranking models most likely to be effective for estimating the coverage and novelty of multiple documents with respect to a sub-query, and a selective approach for automatically predicting how much to diversify the documents retrieved for each individual query. In addition, we perform the first empirical analysis of the role of novelty as a diversification strategy for web search. As demonstrated throughout this thesis, the principles underlying the xQuAD framework are general, sound, and effective. In particular, to validate the contributions of this thesis, we thoroughly assess the effectiveness of xQuAD under the standard experimentation paradigm provided by the diversity task of the TREC 2009, 2010, and 2011 Web tracks. The results of this investigation demonstrate the effectiveness of our proposed framework. Indeed, xQuAD attains consistent and significant improvements in comparison to the most effective diversification approaches in the literature, and across a range of experimental conditions, comprising multiple input rankings, multiple sub-query generation and coverage estimation mechanisms, as well as queries with multiple levels of ambiguity. Altogether, these results corroborate the state-of-the-art diversification performance of xQuAD. Available online at http://theses.gla.ac.uk/4106/.

[1]  Silviu Cucerzan,et al.  Acronym-Expansion Recognition and Ranking on the Web , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[2]  Murat Dundar,et al.  Learning Classifiers When the Training Data Is Not IID , 2007, IJCAI.

[3]  Milad Shokouhi,et al.  From federated to aggregated search , 2010, SIGIR.

[4]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[5]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[6]  Emre Velipasaoglu,et al.  Intent-based diversification of web search results: metrics and algorithms , 2011, Information Retrieval.

[7]  Hang Li,et al.  Named entity recognition in query , 2009, SIGIR.

[8]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[9]  Susan T. Dumais,et al.  Characterizing the value of personalizing search , 2007, SIGIR.

[10]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing , 1974 .

[11]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[12]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[13]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[14]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[15]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[16]  Francesco Bonchi,et al.  From "Dango" to "Japanese Cakes": Query Reformulation Models and Patterns , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[17]  Ben Carterette,et al.  An analysis of NP-completeness in novelty and diversity ranking , 2009, Information Retrieval.

[18]  Paul Over,et al.  TREC-8 interactive track , 1999, SIGF.

[19]  Luca Becchetti,et al.  Link-Based Characterization and Detection of Web Spam , 2006, AIRWeb.

[20]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[21]  Falk Scholer,et al.  User performance versus precision measures for simple search tasks , 2006, SIGIR.

[22]  Korris Fu-Lai Chung,et al.  Improving weak ad-hoc queries using wikipedia asexternal corpus , 2007, SIGIR.

[23]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[24]  Charles L. A. Clarke,et al.  Overview of the TREC 2010 Web Track , 2010, TREC.

[25]  Doug Downey,et al.  Heads and tails: studies of web search with common and rare queries , 2007, SIGIR.

[26]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[27]  Charles L. A. Clarke,et al.  A comparative analysis of cascade measures for novelty and diversity , 2011, WSDM '11.

[28]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[29]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[30]  Jian-Yun Nie,et al.  Integrating word relationships into language models , 2005, SIGIR '05.

[31]  Craig MacDonald,et al.  University of Glasgow at TREC 2010: Experiments with Terrier in Blog and Web Tracks , 2010, TREC.

[32]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[33]  Hang Li,et al.  Machine learning for query-document matching in search , 2012, WSDM '12.

[34]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[35]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[36]  Michael R. Lyu,et al.  Diversifying Query Suggestion Results , 2010, AAAI.

[37]  Hang Li Query Understanding in Web Search - by Large Scale Log Data Mining and Statistical Learning , 2010 .

[38]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[39]  David R. Karger,et al.  Less is More Probabilistic Models for Retrieving Fewer Relevant Documents , 2006 .

[40]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[41]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[42]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[43]  James Allan,et al.  Effective and efficient user interaction for long queries , 2008, SIGIR '08.

[44]  L. Stein,et al.  Probability and the Weighing of Evidence , 1950 .

[45]  In-Ho Kang,et al.  Query type classification for web document retrieval , 2003, SIGIR.

[46]  John D. Lafferty,et al.  A risk minimization framework for information retrieval , 2006, Inf. Process. Manag..

[47]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[48]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[49]  William S. Cooper,et al.  Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval , 1995, TOIS.

[50]  Iadh Ounis,et al.  The Static Absorbing Model for the Web , 2005, J. Web Eng..

[51]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[52]  Christopher Olston,et al.  Search result diversity for informational queries , 2011, WWW.

[53]  Kwok-wai Joseph Lee,et al.  Information retrieval on the world wide web , 2001 .

[54]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[55]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[56]  Jie Peng,et al.  Learning to select for information retrieval , 2010 .

[57]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[58]  Yong Yu,et al.  Identification of ambiguous queries in web search , 2009, Inf. Process. Manag..

[59]  Wei-Ying Ma,et al.  Probabilistic query expansion using query logs , 2002, WWW '02.

[60]  Ricardo A. Baeza-Yates,et al.  Query Recommendation Using Query Logs in Search Engines , 2004, EDBT Workshops.

[61]  Elad Yom-Tov,et al.  Estimating the query difficulty for information retrieval , 2010, Synthesis Lectures on Information Concepts, Retrieval, and Services.

[62]  Craig MacDonald,et al.  University of Glasgow at the NTCIR-9 Intent task: Experiments with Terrier on Subtopic Mining and Document Ranking , 2011, NTCIR.

[63]  Craig MacDonald,et al.  Learning to Select a Ranking Function , 2010, ECIR.

[64]  Jianfeng Gao,et al.  Dependence language model for information retrieval , 2004, SIGIR '04.

[65]  Ankit Jain,et al.  Indexing the World Wide Web: The Journey So Far , 2012 .

[66]  Iadh Ounis,et al.  Selective Application of Query-Independent Features in Web Information Retrieval , 2009, ECIR.

[67]  Jinglei Zhao,et al.  A proximity language model for information retrieval , 2009, SIGIR.

[68]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[69]  Craig MacDonald,et al.  Effectiveness beyond the first crawl tier , 2011, CIKM '11.

[70]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[71]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[72]  Filip Radlinski,et al.  Metrics for assessing sets of subtopics , 2010, SIGIR '10.

[73]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[74]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[75]  Yi-Shin Chen,et al.  Web Information Personalization: Challenges and Approaches , 2003, DNIS.

[76]  Fabrizio Silvestri,et al.  Efficient Diversification of Web Search Results , 2011, Proc. VLDB Endow..

[77]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[78]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[79]  Craig MacDonald,et al.  Learning to predict response times for online query scheduling , 2012, SIGIR '12.

[80]  Elad Yom-Tov,et al.  Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval , 2005, SIGIR '05.

[81]  Qin Iris Wang,et al.  Learning Noun Phrase Query Segmentation , 2007, EMNLP.

[82]  Jaakko Hintikka,et al.  Information and Inference , 1970 .

[83]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[84]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[85]  Filip Radlinski,et al.  Improving personalized web search using result diversification , 2006, SIGIR.

[86]  Milad Shokouhi,et al.  LambdaMerge: merging the results of query reformulations , 2011, WSDM '11.

[87]  Craig MacDonald,et al.  On the suitability of diversity metrics for learning-to-rank for diversity , 2011, SIGIR.

[88]  Eugene Agichtein,et al.  Query Ambiguity Revisited: Clickthrough Measures for Distinguishing Informational and Ambiguous Queries , 2010, NAACL.

[89]  John D. Lafferty,et al.  Information Retrieval as Statistical Translation , 2017 .

[90]  Bolyai János Matematikai Társulat,et al.  Theory of algorithms , 1985 .

[91]  Aristides Gionis,et al.  The query-flow graph: model and applications , 2008, CIKM '08.

[92]  Ben Carterette,et al.  Beyond binary relevance: preferences, diversity, and set-level judgments , 2008, SIGF.

[93]  Stephen E. Robertson,et al.  Probabilistic models of indexing and searching , 1980, SIGIR '80.

[94]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[95]  Efthimis N. Efthimiadis,et al.  Analyzing and evaluating query reformulation strategies in web search logs , 2009, CIKM.

[96]  Leif Azzopardi,et al.  A comparison of user and system query performance predictions , 2010, CIKM '10.

[97]  Olfa Nasraoui,et al.  Mining search engine query logs for query recommendation , 2006, WWW '06.

[98]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[99]  Filip Radlinski,et al.  Learning optimally diverse rankings over large document collections , 2010, ICML.

[100]  Liliana Calderón-Benavides,et al.  Unsupervised Identification of the User's Query Intent in Web Search , 2011 .

[101]  Jun Wang,et al.  Portfolio theory of information retrieval , 2009, SIGIR.

[102]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[103]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[104]  S. Robertson The probability ranking principle in IR , 1997 .

[105]  Juliana Freire,et al.  A fast and robust method for web page template detection and removal , 2006, CIKM '06.

[106]  Ximena Olivares,et al.  Visual diversification of image search results , 2009, WWW '09.

[107]  W. Bruce Croft,et al.  Quantifying query ambiguity , 2002 .

[108]  William Goffman,et al.  On relevance as a measure , 1964, Inf. Storage Retr..

[109]  Tao Qin,et al.  LETOR: A benchmark collection for research on learning to rank for information retrieval , 2010, Information Retrieval.

[110]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[111]  Ido Guy,et al.  Personalized social search based on the user's social network , 2009, CIKM.

[112]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[113]  Marti A. Hearst Search User Interfaces , 2009 .

[114]  Rodrygo L. T. Santos,et al.  Diversifying for Multiple Information Needs , 2011 .

[115]  Sreenivas Gollapudi,et al.  An axiomatic approach for result diversification , 2009, WWW '09.

[116]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[117]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[118]  Vitor R. Carvalho,et al.  Reducing long queries using query quality predictors , 2009, SIGIR.

[119]  Mark Sanderson,et al.  Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[120]  Bert R. Boyce,et al.  Beyond topicality : A two stage view of relevance and the retrieval process , 1982, Inf. Process. Manag..

[121]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[122]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[123]  S. Brodetsky Essai philosophique sur les probabilités , 1922, Nature.

[124]  A. Kaplan,et al.  Users of the world, unite! The challenges and opportunities of Social Media , 2010 .

[125]  Charles L. A. Clarke,et al.  Overview of the TREC 2011 Web Track , 2011, TREC.

[126]  Mark Sanderson,et al.  Ambiguous queries: test collections need more sense , 2008, SIGIR '08.

[127]  Paul Over,et al.  Comparing interactive information retrieval systems across sites: the TREC-6 interactive track matrix experiment , 1998, SIGIR '98.

[128]  Yang Xu,et al.  Query dependent pseudo-relevance feedback based on wikipedia , 2009, SIGIR.

[129]  Craig MacDonald,et al.  From Puppy to Maturity: Experiences in Developing Terrier , 2012, OSIR@SIGIR.

[130]  Rakesh V. Vohra,et al.  A Probabilistic Analysis of the Maximal Covering Location Problem , 1993, Discret. Appl. Math..

[131]  Yi Chang,et al.  Yahoo! Learning to Rank Challenge Overview , 2010, Yahoo! Learning to Rank Challenge.

[132]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[133]  Ben He,et al.  Terrier : A High Performance and Scalable Information Retrieval Platform , 2022 .

[134]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[135]  Yang Song,et al.  Post-ranking query suggestion by diversifying search results , 2011, SIGIR '11.

[136]  David Hawking,et al.  Evaluation by comparing result sets in context , 2006, CIKM '06.

[137]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[138]  Krishna Bharat,et al.  Diversifying web search results , 2010, WWW '10.

[139]  Iadh Ounis,et al.  A study of parameter tuning for term frequency normalization , 2003, CIKM '03.

[140]  Mike Thelwall,et al.  Web crawling ethics revisited: Cost, privacy, and denial of service , 2006, J. Assoc. Inf. Sci. Technol..

[141]  Tim Berners-Lee,et al.  Information Management: A Proposal , 1990 .

[142]  Joshua Goodman,et al.  Online Discriminative Spam Filter Training , 2006, CEAS.

[143]  Iadh Ounis,et al.  Usefulness of hyperlink structure for query-biased topic distillation , 2004, SIGIR '04.

[144]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[145]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[146]  Xiaojin Zhu,et al.  Improving Diversity in Ranking using Absorbing Random Walks , 2007, NAACL.

[147]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[148]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[149]  Donald Metzler,et al.  Automatic feature selection in the markov random field model for information retrieval , 2007, CIKM '07.

[150]  Peter Boros,et al.  Query Segmentation for Web Search , 2003, WWW.

[151]  Hang Li Learning to Rank for Information Retrieval and Natural Language Processing , 2011, Synthesis Lectures on Human Language Technologies.

[152]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[153]  Jianfeng Gao,et al.  Ranking, Boosting, and Model Adaptation , 2008 .

[154]  Charles L. A. Clarke,et al.  On the informativeness of cascade and intent-aware effectiveness measures , 2011, WWW.

[155]  ChengXiang Zhai,et al.  Axiomatic Analysis and Optimization of Information Retrieval Models , 2013, ICTIR.

[156]  Filip Radlinski,et al.  Inferring query intent from reformulations and clicks , 2010, WWW '10.

[157]  Matthew Lease,et al.  Crowdsourcing for search evaluation , 2011, SIGF.

[158]  lawa Kanas,et al.  Metric Spaces , 2020, An Introduction to Functional Analysis.

[159]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[160]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[161]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[162]  Giorgio Gambosi,et al.  FUB, IASI-CNR and University of Tor Vergata at TREC 2008 Blog Track , 2008, TREC.

[163]  Arjen P. de Vries,et al.  Combining implicit and explicit topic representations for result diversification , 2012, SIGIR '12.

[164]  Fuchun Peng,et al.  Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[165]  Stephen E. Robertson,et al.  Microsoft Cambridge at TREC 13: Web and Hard Tracks , 2004, TREC.

[166]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[167]  Farooq Ahmad,et al.  Learning a Spelling Error Model from Search Query Logs , 2005, HLT.

[168]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[169]  Thorsten Joachims,et al.  Online learning to diversify from implicit feedback , 2012, KDD.

[170]  Iadh Ounis,et al.  Query performance prediction , 2006, Inf. Syst..

[171]  Peter Ingwersen,et al.  The development of a method for the evaluation of interactive information retrieval systems , 1997, J. Documentation.

[172]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[173]  Filippo Menczer,et al.  Crawling the Web , 2004, Web Dynamics.

[174]  Pia Borlund,et al.  The concept of relevance in IR , 2003, J. Assoc. Inf. Sci. Technol..

[175]  Stephen M. Omohundro,et al.  Five Balltree Construction Algorithms , 2009 .

[176]  Michael D. Gordon,et al.  A Utility Theoretic Examination of the Probability Ranking Principle in Information Retrieval. , 1991 .

[177]  Craig MacDonald,et al.  Explicit Search Result Diversification through Sub-queries , 2010, ECIR.

[178]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[179]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[180]  Craig MacDonald,et al.  Selectively diversifying web search results , 2010, CIKM.

[181]  Iadh Ounis,et al.  Combining fields for query expansion and adaptive query expansion , 2007, Inf. Process. Manag..

[182]  Craig MacDonald,et al.  Sparse Spatial Selection for Novelty-Based Search Result Diversification , 2011, SPIRE.

[183]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[184]  W. Bruce Croft,et al.  Document quality models for web ad hoc retrieval , 2005, CIKM '05.

[185]  Yiqun Liu,et al.  Overview of the NTCIR-9 INTENT Task , 2011, NTCIR.

[186]  ChengXiang Zhai,et al.  Positional language models for information retrieval , 2009, SIGIR.

[187]  Deepayan Chakrabarti,et al.  Page-level template detection via isotonic smoothing , 2007, WWW '07.

[188]  Jiayu Tang,et al.  Generic and Spatial Approaches to Image Search Results Diversification , 2009, ECIR.

[189]  Omar Alonso,et al.  Crowdsourcing for relevance evaluation , 2008, SIGF.

[190]  Charles L. A. Clarke,et al.  An Effectiveness Measure for Ambiguous and Underspecified Queries , 2009, ICTIR.

[191]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[192]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[193]  Vasileios Plachouras,et al.  Selective web information retrieval , 2006 .

[194]  Jaime Teevan,et al.  Implicit feedback for inferring user preference: a bibliography , 2003, SIGF.

[195]  Tetsuya Sakai,et al.  Diversified search evaluation: lessons from the NTCIR-9 INTENT task , 2012, Information Retrieval.

[196]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[197]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[198]  Iadh Ounis,et al.  University of Glasgow at TREC 2006: Experiments in Terabyte and Enterprise Tracks with Terrier , 2006, TREC.

[199]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[200]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[201]  Rodrygo L. T. Santos,et al.  Large-scale information retrieval experimentation with terrier , 2011, CIKM '11.

[202]  Albert N. Link,et al.  Economic impact assessment of NIST's text REtrieval conference (TREC) program. Final report , 2010 .

[203]  Paul Over,et al.  TREC-6 Interactive Report , 1997, TREC.

[204]  Don R. Swanson,et al.  Probabilistic models for automatic indexing , 1974, J. Am. Soc. Inf. Sci..

[205]  Cyril W. Cleverdon,et al.  The significance of the Cranfield tests on index languages , 1991, SIGIR '91.

[206]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[207]  Craig MacDonald,et al.  Voting for related entities , 2010, RIAO.

[208]  Iadh Ounis,et al.  Global Statistics in Proximity Weighting Models , 2010 .

[209]  Joseph G. Pigeon,et al.  Statistics for Experimenters: Design, Innovation and Discovery , 2006, Technometrics.

[210]  Tetsuya Sakai Evaluation with informational and navigational intents , 2012, WWW.

[211]  Fabrizio Silvestri,et al.  Generating suggestions for queries in the long tail with an inverted index , 2012, Inf. Process. Manag..

[212]  Sihem Amer-Yahia,et al.  Efficient Computation of Diverse Query Results , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[213]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[214]  Pablo Castells,et al.  Personalized diversification of search results , 2012, SIGIR '12.

[215]  Abbe Mowshowitz,et al.  Assessing bias in search engines , 2002, Inf. Process. Manag..

[216]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[217]  Rodrygo L. T. Santos,et al.  The whens and hows of learning to rank for web search , 2012, Information Retrieval.

[218]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[219]  Yang Zhang,et al.  Exploring Distributional Similarity Based Models for Query Spelling Correction , 2006, ACL.

[220]  P. Dirac Principles of Quantum Mechanics , 1982 .

[221]  Emine Yilmaz,et al.  The maximum entropy method for analyzing retrieval measures , 2005, SIGIR '05.

[222]  Peter Ingwersen,et al.  The Turn - Integration of Information Seeking and Retrieval in Context , 2005, The Kluwer International Series on Information Retrieval.

[223]  Michael D. Gordon,et al.  When is the probability ranking principle suboptimal , 1992 .

[224]  Fabrizio Silvestri,et al.  Mining Query Logs: Turning Search Usage Data into Knowledge , 2010, Found. Trends Inf. Retr..

[225]  Matthew Richardson,et al.  Predicting clicks: estimating the click-through rate for new ads , 2007, WWW '07.

[226]  C. Cleverdon Report on the testing and analysis of an investigation into comparative efficiency of indexing systems , 1962 .

[227]  Rodrygo L. T. Santos,et al.  Information Retrieval on the Blogosphere , 2012, Found. Trends Inf. Retr..

[228]  Thorsten Joachims,et al.  Accurately Interpreting Clickthrough Data as Implicit Feedback , 2017 .

[229]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[230]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[231]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[232]  W. Bruce Croft,et al.  Quality-biased ranking of web documents , 2011, WSDM '11.

[233]  Harry Shum,et al.  Query Dependent Ranking Using K-nearest Neighbor * , 2022 .

[234]  Siméon-Denis Poisson Recherches sur la probabilité des jugements en matière criminelle et en matiére civile, précédées des règles générales du calcul des probabilités , 1837 .

[235]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[236]  Iadh Ounis,et al.  Incorporating term dependency in the dfr framework , 2007, SIGIR.

[237]  Ellen M. Voorhees,et al.  Overview of the Seventh Text REtrieval Conference , 1998 .

[238]  Mark Sanderson,et al.  Multiple approaches to analysing query diversity , 2009, SIGIR.

[239]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[240]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[241]  Gianni Amati,et al.  Frequentist and Bayesian Approach to Information Retrieval , 2006, ECIR.

[242]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[243]  Norbert Fuhr,et al.  Optimum polynomial retrieval functions based on the probability ranking principle , 1989, TOIS.

[244]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[245]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[246]  Stephen E. Robertson,et al.  Simple Evaluation Metrics for Diversified Search Results , 2010, EVIA@NTCIR.

[247]  Qiang Yang,et al.  Building bridges for web query classification , 2006, SIGIR.

[248]  Stephen E. Robertson,et al.  Ambiguous requests: implications for retrieval tests, systems and theories , 2007, SIGF.

[249]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[250]  Craig MacDonald,et al.  Aggregated Search Result Diversification , 2011, ICTIR.

[251]  Gianluca Demartini,et al.  ARES: A Retrieval Engine Based on Sentiments - Sentiment-Based Search Result Annotation and Diversification , 2011, ECIR.

[252]  David Hawking,et al.  Overview of the TREC 2004 Web Track , 2004, TREC.

[253]  Jeffrey Dean,et al.  Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[254]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[255]  Craig MacDonald,et al.  How diverse are web search results? , 2011, SIGIR '11.

[256]  Brian D. Davison,et al.  Adversarial Web Search , 2011, Found. Trends Inf. Retr..

[257]  Tapas Kanungo,et al.  Predicting the readability of short web summaries , 2009, WSDM '09.

[258]  Craig MacDonald,et al.  On the usefulness of query features for learning to rank , 2012, CIKM.

[259]  Aristides Gionis,et al.  Improving recommendation for long-tail queries via templates , 2011, WWW.

[260]  Claudio Carpineto,et al.  Query Difficulty, Robustness, and Selective Application of Query Expansion , 2004, ECIR.

[261]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[262]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[263]  Jane Li,et al.  Good abandonment in mobile and PC internet search , 2009, SIGIR.

[264]  Ben Carterette,et al.  Probabilistic models of ranking novel documents for faceted topic retrieval , 2009, CIKM.

[265]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[266]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[267]  J. Neumann,et al.  Theory of games and economic behavior , 1945, 100 Years of Math Milestones.

[268]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[269]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[270]  C. Lanczos,et al.  A Precision Approximation of the Gamma Function , 1964 .

[271]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[272]  Stefan Evert A Lightweight and Efficient Tool for Cleaning Web Pages , 2008, LREC.

[273]  Charles L. A. Clarke,et al.  Overview of the TREC 2012 Web Track , 2012, TREC.

[274]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[275]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[276]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[277]  Thomas Hofmann,et al.  Learning to Rank with Nonsmooth Cost Functions , 2006, NIPS.

[278]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[279]  M. de Rijke,et al.  Result diversification based on query-specific cluster ranking , 2011, J. Assoc. Inf. Sci. Technol..

[280]  Wei Zheng,et al.  Exploiting concept hierarchy for result diversification , 2012, CIKM.

[281]  Gianni Amati,et al.  Probability models for information retrieval based on divergence from randomness , 2003 .

[282]  Craig MacDonald,et al.  Intent-aware search result diversification , 2011, SIGIR.

[283]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[284]  Guido Zuccon,et al.  Using the Quantum Probability Ranking Principle to Rank Interdependent Documents , 2010, ECIR.

[285]  Rohini K. Srihari,et al.  Biterm language models for document retrieval , 2002, SIGIR '02.

[286]  Kenneth Ward Church,et al.  Query suggestion using hitting time , 2008, CIKM '08.

[287]  Stefano Mizzaro Relevance: the whole history , 1997 .

[288]  Craig MacDonald,et al.  Learning to rank query suggestions for adhoc and diversity search , 2012, Information Retrieval.

[289]  Thorsten Joachims,et al.  Predicting diverse subsets using structural SVMs , 2008, ICML '08.

[290]  Ophir Frieder,et al.  Automatic web query classification using labeled and unlabeled training data , 2005, SIGIR '05.

[291]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[292]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[293]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[294]  Ellen M. Voorhees,et al.  TREC: Continuing information retrieval's tradition of experimentation , 2007, CACM.

[295]  Mark Sanderson,et al.  Do user preferences and evaluation measures line up? , 2010, SIGIR.

[296]  Craig MacDonald,et al.  University of Glasgow at TREC 2011: Experiments with Terrier in Crowdsourcing, Microblog, and Web Tracks , 2011, TREC.

[297]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[298]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[299]  ChengXiang Zhai,et al.  Mining term association patterns from search logs for effective query reformulation , 2008, CIKM '08.

[300]  Gary James Jason,et al.  The Logic of Scientific Discovery , 1988 .

[301]  Stephen E. Robertson,et al.  Okapi at TREC-2 , 1993, TREC.

[302]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[303]  Tefko Saracevic,et al.  Evaluation of evaluation in information retrieval , 1995, SIGIR '95.

[304]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[305]  Craig MacDonald,et al.  University of Glasgow at TREC 2009: Experiments with Terrier , 2009, TREC.

[306]  Craig MacDonald,et al.  On the role of novelty for search result diversification , 2011, Information Retrieval.

[307]  Ji-Rong Wen,et al.  Multi-dimensional search result diversification , 2011, WSDM '11.

[308]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[309]  Samir Khuller,et al.  The Budgeted Maximum Coverage Problem , 1999, Inf. Process. Lett..

[310]  Xin Li,et al.  Context sensitive stemming for web search , 2007, SIGIR.

[311]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project: report on the testing and analysis of an investigation into the comparative efficiency of indexing systems , 1962 .

[312]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[313]  W. Bruce Croft,et al.  Uncertainty in Information Retrieval Systems , 1996, Uncertainty Management in Information Systems.

[314]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[315]  Craig MacDonald,et al.  University of Glasgow at WebCLEF 2005: Experiments in per-field Normalisation and Language Specific Stemming , 2005, CLEF.

[316]  Sofia Stamou,et al.  Interpreting User Inactivity on Search Results , 2010, ECIR.

[317]  Craig MacDonald,et al.  University of Glasgow at TREC 2012: Experiments with Terrier in Medical Records, Microblog, and Web Tracks , 2012, TREC.

[318]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[319]  W. Bruce Croft,et al.  Learning to rank query reformulations , 2010, SIGIR '10.

[320]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[321]  Nivio Ziviani,et al.  Discovering Search Engine Related Queries Using Association Rules , 2003, J. Web Eng..

[322]  Craig MacDonald,et al.  Modelling efficient novelty-based search result diversification in metric spaces , 2013, J. Discrete Algorithms.

[323]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[324]  Fred J. Damerau,et al.  An experiment in automatic indexing , 1965 .

[325]  Berkant Barla Cambazoglu,et al.  Early exit optimizations for additive machine learned ranking systems , 2010, WSDM '10.

[326]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing , 1975, J. Am. Soc. Inf. Sci..

[327]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[328]  Susan T. Dumais,et al.  The web changes everything: understanding the dynamics of web content , 2009, WSDM '09.

[329]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[330]  Francesco Bonchi,et al.  Query suggestions using query-flow graphs , 2009, WSCD '09.

[331]  Craig MacDonald,et al.  Exploiting query reformulations for web search result diversification , 2010, WWW '10.

[332]  Scott Sanner,et al.  Diverse retrieval via greedy optimization of expected 1-call@k in a latent subtopic relevance model , 2011, CIKM '11.

[333]  W. Bruce Croft,et al.  Query performance prediction in web search environments , 2007, SIGIR.