Index ordering by query-independent measures

Conventional approaches to information retrieval search through all applicable entries in an inverted file for a particular collection in order to find those documents with the highest scores. For particularly large collections this may be extremely time consuming. A solution to this problem is to only search a limited amount of the collection at query-time, in order to speed up the retrieval process. In doing this we can also limit the loss in retrieval efficacy (in terms of accuracy of results). The way we achieve this is to firstly identify the most ''important'' documents within the collection, and sort documents within inverted file lists in order of this ''importance''. In this way we limit the amount of information to be searched at query time by eliminating documents of lesser importance, which not only makes the search more efficient, but also limits loss in retrieval accuracy. Our experiments, carried out on the TREC Terabyte collection, report significant savings, in terms of number of postings examined, without significant loss of effectiveness when based on several measures of importance used in isolation, and in combination. Our results point to several ways in which the computation cost of searching large collections of documents can be significantly reduced.

[1]  Prabhakar Raghavan,et al.  Mining the Link Structure of the World Wide Web , 1998 .

[2]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[3]  Cyril Cleverdon,et al.  Optimizing convenient online access to bibliographic databases , 1984 .

[4]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[5]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[6]  David Carmel,et al.  Juru at TREC 2006: TAAT versus DAAT in the Terabyte Track , 2006, TREC.

[7]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[8]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing , 1974 .

[9]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[10]  M. Lamming,et al.  "Forget-me-not" Intimate Computing in Support of Human Memory , 1994 .

[11]  Joemon M. Jose,et al.  Spatial querying for image retrieval: a user-oriented evaluation , 1998, SIGIR '98.

[12]  Alistair Moffat,et al.  Impact transformation: effective and efficient web retrieval , 2002, SIGIR '02.

[13]  Milad Shokouhi,et al.  RMIT University at TREC 2006: Terabyte Track , 2006, TREC.

[14]  Susan Gauch,et al.  Incorporating quality metrics in centralized/distributed information retrieval on the World Wide Web , 2000, SIGIR '00.

[15]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[16]  Noel E. O'Connor,et al.  Exploiting context information to aid landmark detection in SenseCam images , 2006 .

[17]  Michael B. Eisenberg,et al.  A re-examination of relevance: toward a dynamic, situational definition , 1990, Inf. Process. Manag..

[18]  Alistair Moffat,et al.  Melbourne University at the 2006 Terabyte Track , 2006, TREC.

[19]  Alistair Moffat,et al.  Term Impacts as Normalized Term Frequencies for BM25 Similarity Scoring , 2008, SPIRE.

[20]  Javed A. Aslam,et al.  Relevance score normalization for metasearch , 2001, CIKM '01.

[21]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[22]  Matthew Richardson,et al.  The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank , 2001, NIPS.

[23]  Fabrizio Silvestri,et al.  Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data , 2006, TOIS.

[24]  Tom A. B. Snijders,et al.  Social Network Analysis , 2011, International Encyclopedia of Statistical Science.

[25]  Isola Ajiferuke,et al.  A total relevance and document interaction effects model for the evaluation of information retrieval processes , 1988, Inf. Process. Manag..

[26]  Chih-Jen Lin,et al.  Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel , 2003, Neural Computation.

[27]  Charles L. A. Clarke,et al.  Index Pruning and Result Reranking: Effects on Ad-Hoc Retrieval and Named Page Finding , 2006, TREC.

[28]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[29]  Shlomo Moran,et al.  Predictive caching and prefetching of query results in search engines , 2003, WWW '03.

[30]  Michael Persin,et al.  Document filtering for fast ranking , 1994, SIGIR '94.

[31]  Josiane Mothe,et al.  Linguistic Analysis of Users' Queries: Towards an Adaptive Information Retrieval System , 2007, 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System.

[32]  Alan F. Smeaton,et al.  Top subset retrieval on large collections using sorted indices , 2005, SIGIR '05.

[33]  Robert Weibel,et al.  Spatial information retrieval and geographical ontologies an overview of the SPIRIT project , 2002, SIGIR '02.

[34]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[35]  Don R. Swanson,et al.  Probabilistic models for automatic indexing , 1974, J. Am. Soc. Inf. Sci..

[36]  Qiang Yang,et al.  Exploiting the hierarchical structure for link analysis , 2005, SIGIR '05.

[37]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[38]  Alastair G. Smith A Guide to Finding Quality Information on the Internet: Selection and Evaluation Strategies (2nd ed.) , 2002 .

[39]  Alan F. Smeaton,et al.  Físréal: A Low Cost Terabyte Search Engine , 2005, ECIR.

[40]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[41]  HenzingerMonika,et al.  Analysis of a very large web search engine query log , 1999 .

[42]  Lionel Valet,et al.  A statistical overview of recent literature in information fusion , 2001 .

[43]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[44]  Garrison W. Cottrell,et al.  Fusion Via a Linear Combination of Scores , 1999, Information Retrieval.

[45]  Arthur P. Dempster,et al.  A Generalization of Bayesian Inference , 1968, Classic Works of the Dempster-Shafer Theory of Belief Functions.

[46]  Eric Brill,et al.  Beyond PageRank: machine learning for static ranking , 2006, WWW '06.

[47]  Clement Yu,et al.  Diogenes: A Web Search Agent for Content Based Indexing of Personal Images , 2000, SIGIR 2000.

[48]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[49]  E GARFIELD,et al.  Citation indexes for science; a new dimension in documentation through association of ideas. , 2006, Science.

[50]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[51]  Gordon Bell,et al.  MyLifeBits: fulfilling the Memex vision , 2002, MULTIMEDIA '02.

[52]  Alistair Moffat,et al.  Melbourne University 2004: Terabyte and Web Tracks , 2004, TREC.

[53]  Djoerd Hiemstra,et al.  A survey of pre-retrieval query performance predictors , 2008, CIKM '08.

[54]  Alistair Moffat,et al.  Memory Efficient Ranking , 1994, Inf. Process. Manag..

[55]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[56]  Alan F. Smeaton,et al.  Experiments in Terabyte Searching, Genomic Retrieval and Novelty Detection for TREC 2004 , 2004, TREC.

[57]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[58]  David Hawking,et al.  Query-independent evidence in home page finding , 2003, TOIS.

[59]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[60]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[61]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[62]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[63]  Amanda Spink,et al.  A Study of Web Search Trends , 2004, Webology.

[64]  Loren G. Terveen,et al.  Does “authority” mean quality? predicting expert quality ratings of Web documents , 2000, SIGIR '00.

[65]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[66]  Kevin S. McCurley,et al.  Link Structure of Hierarchical Information Networks , 2004 .

[67]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[68]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[69]  Iadh Ounis,et al.  Dempster-Shafer Theory for a Query-Biased Combination of Evidence on the Web , 2005, Information Retrieval.

[70]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[71]  T. Park The Nature of Relevance in Information Retrieval: An Empirical Study , 1993, The Library Quarterly.

[72]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[73]  Alan F. Smeaton,et al.  Replicating Web Structure in Small-Scale Test Collections , 2004, Information Retrieval.

[74]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[75]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[76]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[77]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[78]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[79]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[80]  Torsten Suel,et al.  Optimized Query Execution in Large Search Engines with Global Page Ordering , 2003, VLDB.

[81]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[82]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[83]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[84]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[85]  Garrison W. Cottrell,et al.  Automatic combination of multiple ranked retrieval systems , 1994, SIGIR '94.

[86]  Alistair Moffat,et al.  Vector-space ranking with effective early termination , 2001, SIGIR '01.

[87]  Shlomo Moran,et al.  SALSA: the stochastic approach for link-structure analysis , 2001, TOIS.

[88]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[89]  Alistair Moffat,et al.  Improved Retrieval Effectiveness Through Impact Transformation , 2002, Australasian Database Conference.

[90]  Bernhard Schölkopf,et al.  Incorporating Invariances in Support Vector Learning Machines , 1996, ICANN.

[91]  Ronald Fagin,et al.  Static index pruning for information retrieval systems , 2001, SIGIR '01.

[92]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[93]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[94]  Nicholas J. Belkin,et al.  Characteristics of Texts Affecting Relevance Judgments , 1993 .

[95]  Charles L. A. Clarke,et al.  A document-centric approach to static index pruning in text retrieval systems , 2006, CIKM '06.

[96]  Ron Sacks-Davis,et al.  Filtered document retrieval with frequency-sorted indexes , 1996 .

[97]  Alexandros Ntoulas,et al.  Pruning policies for two-tiered inverted index with correctness guarantee , 2007, SIGIR.

[98]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[99]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[100]  Jeffrey Katzer,et al.  A study of the overlap among document representations , 1983, SIGIR '83.

[101]  Ricardo Baeza-Yates,et al.  ResIn: a combination of results caching and index pruning for high-performance web search engines , 2008, SIGIR '08.

[102]  S. Golomb Run-length encodings. , 1966 .

[103]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[104]  Peiling Wang,et al.  A Cognitive Model of Document Use During a Research Project. Study II. Decisions at the Reading and Citing Stages , 1999, Journal of the American Society for Information Science.

[105]  C. J. van Rijsbergen,et al.  The nearest neighbour problem in information retrieval: an algorithm using upperbounds , 1981, SIGIR '81.

[106]  David Carmel,et al.  Juru at TREC 10 - Experiments with Index Pruning , 2001, TREC.

[107]  Alan F. Smeaton,et al.  Dublin City University at the TREC 2005 Terabyte Track , 2005, TREC.

[108]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[109]  David Hawking,et al.  Overview of the TREC-9 Web Track , 2000, TREC.

[110]  Steven Garcia,et al.  Access-Ordered Indexes , 2004, ACSC.

[111]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[112]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[113]  Amanda Spink,et al.  An Analysis of Web Documents Retrieved and Viewed , 2003, International Conference on Internet Computing.

[114]  Foster Provost,et al.  Machine Learning from Imbalanced Data Sets 101 , 2008 .

[115]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[116]  Massimo Marchiori,et al.  The Quest for Correct Information on the Web: Hyper Search Engines , 1997, Comput. Networks.

[117]  E. Garfield Citation analysis as a tool in journal evaluation. , 1972, Science.

[118]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[119]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[120]  Alistair Moffat,et al.  Simplified similarity scoring using term ranks , 2005, SIGIR '05.

[121]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .