Selective web information retrieval

This thesis proposes selective Web information retrieval, a framework formulated in terms of statistical decision theory, with the aim to apply an appropriate retrieval approach on a per-query basis. The main component of the framework is a decision mechanism that selects an appropriate retrieval approach on a per-query basis. The selection of a particular retrieval approach is based on the outcome of an experiment, which is performed before the final ranking of the retrieved documents. The experiment is a process that extracts features from a sample of the set of retrieved documents. This thesis investigates three broad types of experiments. The first one counts the occurrences of query terms in the retrieved documents, indicating the extent to which the query topic is covered in the document collection. The second type of experiments considers information from the distribution of retrieved documents in larger aggregates of related Web documents, such as whole Web sites, or directories within Web sites. The third type of experiments estimates the usefulness of the hyperlink structure among a sample of the set of retrieved Web documents. The proposed experiments are evaluated in the context of both informational and navigational search tasks with an optimal Bayesian decision mechanism, where it is assumed that relevance information exists. This thesis further investigates the implications of applying selective Web information retrieval in an operational setting, where the tuning of a decision mechanism is based on limited existing relevance information and the information retrieval system’s input is a stream of queries related to mixed informational and navigational search tasks. First, the experiments are evaluated using different training and testing query sets, as well as a mixture of different types of queries. Second, query sampling is introduced, in order to approximate the queries that a retrieval system receives, and to tune an ad-hoc decision mechanism with a broad set of automatically sampled queries.

[1]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[2]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[3]  Jacques Savoy,et al.  Retrieval effectiveness on the web , 2001, Inf. Process. Manag..

[4]  R. Manmatha,et al.  Modeling score distributions for combining the outputs of search engines , 2001, SIGIR '01.

[5]  Thorsten Joachims,et al.  Accurately Interpreting Clickthrough Data as Implicit Feedback , 2017 .

[6]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[7]  M. de Rijke,et al.  Approaches to Robust and Web Retrieval , 2003, TREC.

[8]  Tomohiro Takagi,et al.  Meiji University Web, Novelty and Genomic Track Experiments , 2004, TREC.

[9]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[10]  Donald H. Kraft,et al.  Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval , 1998, SIGIR 2002.

[11]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[12]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[13]  Steven Vajda,et al.  Games and Decisions. By R. Duncan Luce and Howard Raiffa. Pp. xi, 509. 70s. 1957. (J Wiley & Sons) , 1959, The Mathematical Gazette.

[14]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[15]  Edward A. Fox,et al.  Research Contributions , 2014 .

[16]  Richard M. Everson,et al.  When Are Links Useful? Experiments in Text Classification , 2003, ECIR.

[17]  Paul B. Kantor,et al.  A study of information seeking and retrieving. III. Searchers, searches, and overlap , 1988, J. Am. Soc. Inf. Sci..

[18]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[19]  Eli Upfal,et al.  Using PageRank to Characterize Web Structure , 2002, COCOON.

[20]  Deniz Yuret From Genetic Algorithms to Efficient Optimization , 1994 .

[21]  Iadh Ounis,et al.  University of Glasgow at the Web Track: Dynamic Application of Hyperlink Analysis using the Query Scope , 2003, TREC.

[22]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[23]  Wen-Syan Li,et al.  Defining logical domains in a web site , 2000, HYPERTEXT '00.

[24]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[25]  David Carmel,et al.  Juru at TREC 2003 - Topic Distillation using Query-Sensitive Tuning and Cohesiveness Filtering , 2003, TREC.

[26]  Mounia Lalmas,et al.  Combining evidence for Web retrieval using the inference network model: an experimental study , 2004, Inf. Process. Manag..

[27]  Wei Zhang,et al.  Improvement of HITS-based algorithms on web documents , 2002, WWW '02.

[28]  Ellen Spertus,et al.  ParaSite: Mining Structural Information on the Web , 1997, Comput. Networks.

[29]  S. Sheather Density Estimation , 2004 .

[30]  Djoerd Hiemstra,et al.  Retrieving Web Pages Using Content, Links, URLs and Anchors , 2001, TREC.

[31]  Soumen Chakrabarti,et al.  Enhanced topic distillation using text, markup tags, and hyperlinks , 2001, SIGIR '01.

[32]  Iadh Ounis,et al.  The Static Absorbing Model for the Web , 2005, J. Web Eng..

[33]  Shlomo Moran,et al.  SALSA: the stochastic approach for link-structure analysis , 2001, TOIS.

[34]  Keishi Tajima,et al.  Discovery and Retrieval of Logical Information Units in Web , 1999, WOWS.

[35]  W. Dixon,et al.  Introduction to Mathematical Statistics. , 1964 .

[36]  Hua-Jun Zeng,et al.  Applying Associative Relationship on the Clickthrough Data to Improve Web Search , 2005, ECIR.

[37]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[38]  Ronald Fagin,et al.  Static index pruning for information retrieval systems , 2001, SIGIR '01.

[39]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[40]  Stephen Tomlinson European Web Retrieval Experiments with Hummingbird SearchServer™ at CLEF 2005 , 2005, CLEF.

[41]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[42]  In-Ho Kang,et al.  Query type classification for web document retrieval , 2003, SIGIR.

[43]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[44]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[45]  David Hawking,et al.  Overview of the TREC-2001 Web track , 2002 .

[46]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[47]  Norbert Fuhr,et al.  From Uncertain Inference to Probability of Relevance for Advanced IR Applications , 2003, ECIR.

[48]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[49]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[50]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[51]  E. Garfield Citation analysis as a tool in journal evaluation. , 1972, Science.

[52]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[53]  David Hawking,et al.  Overview of the TREC 2003 Web Track , 2003, TREC.

[54]  Stephen E. Robertson,et al.  Microsoft Cambridge at TREC 13: Web and Hard Tracks , 2004, TREC.

[55]  I. Ounis,et al.  The Dynamic Absorbing Model for the Web , 2003 .

[56]  Kui-Lam Kwok,et al.  TREC 2002 Web, Novelty and Filtering Track Experiments using PIRCS , 2002, TREC.

[57]  Gilad Mishne,et al.  Language Models for Searching in Web Corpora , 2004, TREC.

[58]  Jacques Savoy,et al.  An Extended Vector-Processing Scheme for Searching Information in Hypertext Systems , 1996, Inf. Process. Manag..

[59]  Susan T. Dumais,et al.  Probabilistic combination of content and links , 2001, SIGIR '01.

[60]  AnnBritt Enochsson FINDING INFORMATION ON THE WORLD WIDE WEB , 1998 .

[61]  Kevin S. McCurley,et al.  Untangling compound documents on the web , 2003, HYPERTEXT '03.

[62]  Andrei Z. Broder,et al.  Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[63]  B. W. Lindgren Elements of decision theory , 1971 .

[64]  Iadh Ounis,et al.  A study of the dirichlet priors for term frequency normalisation , 2005, SIGIR '05.

[65]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[66]  Berthier A. Ribeiro-Neto,et al.  A belief network model for IR , 1996, SIGIR '96.

[67]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[68]  Ben Shneiderman,et al.  Structural analysis of hypertexts: identifying hierarchies and useful metrics , 1992, TOIS.

[69]  Ronald Fagin,et al.  Searching the workplace web , 2003, WWW '03.

[70]  David Hawking,et al.  ACSys TREC-7 Experiments , 1998, TREC.

[71]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[72]  Rabia Nuray-Turan,et al.  Automatic performance evaluation of Web search engines , 2004, Inf. Process. Manag..

[73]  Peter Bailey,et al.  Engineering a multi-purpose test collection for Web retrieval experiments , 2003, Inf. Process. Manag..

[74]  Andrei Z. Broder,et al.  Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.

[75]  James G. Shanahan,et al.  Topic structure modeling , 2002, SIGIR '02.

[76]  Iadh Ounis,et al.  University of Glasgow at TREC 2004: Experiments in Web, Robust, and Terabyte Tracks with Terrier , 2004, TREC.

[77]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.

[78]  Jacques Savoy,et al.  Report on the TREC-10 Experiment: Distributed Collections and Entrypage Searching , 2001, TREC.

[79]  Keishi Tajima,et al.  Cut as a querying unit for WWW, Netnews, and E-mail , 1998, HYPERTEXT '98.

[80]  Stephen Robertson On Bayesian models and event spaces in information retrieval , 2002 .

[81]  Frank G. Halasz,et al.  Reflections on NoteCards: seven issues for the next generation of hypermedia systems , 1987, CACM.

[82]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[83]  Iadh Ounis,et al.  Usefulness of hyperlink structure for query-biased topic distillation , 2004, SIGIR '04.

[84]  Allan Borodin,et al.  Finding authorities and hubs from link structures on the World Wide Web , 2001, WWW '01.

[85]  David C. Blair,et al.  Some thoughts on the reported results of TREC , 2002, Inf. Process. Manag..

[86]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[87]  Abdur Chowdhury,et al.  Using titles and category names from editor-driven taxonomies for automatic evaluation , 2003, CIKM '03.

[88]  Claudio Carpineto,et al.  Query Difficulty, Robustness, and Selective Application of Query Expansion , 2004, ECIR.

[89]  Huberman,et al.  Strong regularities in world wide web surfing , 1998, Science.

[90]  Peter Bailey,et al.  Measuring Search Engine Quality , 2001, Information Retrieval.

[91]  John D. Lafferty,et al.  Cranking: Combining Rankings Using Conditional Probability Models on Permutations , 2002, ICML.

[92]  Leif Azzopardi,et al.  Age Dependent Document Priors in Link Structure Analysis , 2005, ECIR.

[93]  Gabriel Pinski,et al.  Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics , 1976, Inf. Process. Manag..

[94]  David Lindley,et al.  Statistical Decision Functions , 1951, Nature.

[95]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[96]  C. Lee Giles,et al.  Accessibility of information on the Web , 2000, INTL.

[97]  Ophir Frieder,et al.  Hourly analysis of a very large topically categorized web query log , 2004, SIGIR '04.

[98]  Amos Fiat,et al.  Web search via hub synthesis , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[99]  Iadh Ounis,et al.  A study of parameter tuning for term frequency normalization , 2003, CIKM '03.

[100]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[101]  Min Zhang,et al.  TREC-10 Web Track Experiments at MSRA , 2001, TREC.

[102]  Alan F. Smeaton,et al.  Information retrieval from hypertext using dynamically planned guided tours , 1993, ECHT '92.

[103]  Elad Yom-Tov,et al.  Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval , 2005, SIGIR '05.

[104]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[105]  Hans-Peter Frei,et al.  The Use of Semantic Links in Hypertext Information Retrieval , 1995, Inf. Process. Manag..

[106]  Bernard J. Jansen,et al.  A review of web searching studies and a framework for future research , 2001 .

[107]  Justin Zobel,et al.  A Scalable System for Identifying Co-derivative Documents , 2004, SPIRE.

[108]  David Hawking,et al.  Query-independent evidence in home page finding , 2003, TOIS.

[109]  David D. Lewis The TREC-4 Filtering Track , 1995, TREC.

[110]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[111]  P. Lévy Qu'est-ce que le virtuel? , 1995 .

[112]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[113]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[114]  Abdur Chowdhury,et al.  Automatic evaluation of world wide web search services , 2002, SIGIR '02.

[115]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[116]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[117]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[118]  Paul Erdös,et al.  On random graphs, I , 1959 .

[119]  Hector Garcia-Molina,et al.  Finding near-replicas of documents on the Web , 1999 .

[120]  David M. Pennock,et al.  Winners don't take all: Characterizing the competition for links on the web , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[121]  Beat Kleiner,et al.  Graphical Methods for Data Analysis , 1983 .

[122]  Alan F. Smeaton,et al.  Improving the Evaluation of Web Search Systems , 2003, ECIR.

[123]  Kevin S. McCurley,et al.  Analysis of anchor text for web search , 2003, SIGIR.

[124]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[125]  David Hawking,et al.  Toward better weighting of anchors , 2004, SIGIR '04.

[126]  Iadh Ounis,et al.  Selective Combination of Evidence for Topic Distillation using Document and Aggregate-level Information , 2004, RIAO.

[127]  Rong Jin,et al.  Title language model for information retrieval , 2002, SIGIR '02.

[128]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[129]  W. Bruce Croft Combining Approaches to Information Retrieval , 2002 .

[130]  Randall Hagner Trigg,et al.  A network-based approach to text handling for the on-line scientific community , 1983 .

[131]  Marco Gori,et al.  Web page scoring systems for horizontal and vertical search , 2002, WWW.

[132]  David Carmel,et al.  Topic Distillation with Knowledge Agents , 2002, TREC.

[133]  David Hawking,et al.  How Valuable is External Link Evidence When Searching Enterprise Webs? , 2004, ADC.

[134]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[135]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[136]  Peter Bailey,et al.  Overview of the TREC-8 Web Track , 2000, TREC.