Impact-Based Document Retrieval

Two of the most important aspects contributing to the success of any document retrieval system are the query mechanism and the representation of its auxiliary operational data. The former greatly affects the quality of the retrieval results as well as the speed of the system. The latter reflects the ability of the system to represent its operational data in a compact form that reduces the storage space and enhances query processing speed. Addressing these two aspects, conventional document retrieval systems rely on a range of stored data, such as term and document frequencies, and document lengths. These statistics are often stored in a well-known index data structure, using compression techniques that are bitor byte-oriented. During query processing, the related statistics are decompressed and then involved in some precise and complicated quantitative formula that determines the final retrieval results. This thesis presents a novel approach of impact-based retrieval that has the twin benefits of enhancing both the effectiveness and the efficiency of retrieval. Several variants of impact-based retrieval are described. The most advanced variant does not require any precise similarity formulation as the impacts are defined qualitatively rather than quantitatively. It gives better effectiveness performance relative to several state-of-the-art retrieval techniques. Moreover, it results in the use of only a few distinct impact values, allowing a more compact organisation of the indexes, and faster query processing. To further accelerate query evaluation, several word-oriented index compression schemes have been developed, offering good compression effectiveness and fast decompression. These schemes take the best features of the bitand byte-oriented methods and represent an excellent compromise between them.

[1]  Berthier A. Ribeiro-Neto,et al.  A belief network model for IR , 1996, SIGIR '96.

[2]  Alistair Moffat,et al.  Effective document presentation with a locality-based similarity heuristic , 1999, SIGIR '99.

[3]  Alistair Moffat,et al.  Memory Efficient Ranking , 1994, Inf. Process. Manag..

[4]  Hans-Jörg Schek,et al.  Generating Vector Spaces On-the-fly for Flexible XML Retrieval , 2002 .

[5]  Fredric C. Gey,et al.  The TREC 2002 Arabic/English CLIR Track , 2002, TREC.

[6]  Robert M. Losee,et al.  Integrating Boolean queries in conjunctive normal form with probabilistic retrieval models , 1988, Inf. Process. Manag..

[7]  Christopher J. Fox,et al.  Lexical Analysis and Stoplists , 1992, Information Retrieval: Data Structures & Algorithms.

[8]  E. Michael Keen,et al.  Presenting Results of Experimental Retrieval Comparisons , 1997, Inf. Process. Manag..

[9]  Alistair Moffat,et al.  Parameterised compression for sparse bitmaps , 1992, SIGIR '92.

[10]  Alistair Moffat,et al.  Vector Space Ranking: Can We Keep it Simple? , 2002, Australasian Document Computing Symposium.

[11]  Forbes J. Burkowski Retrieval activities in a database consisting of heterogeneous collections of structured text , 1992, SIGIR '92.

[12]  Ronald Fagin,et al.  Static index pruning for information retrieval systems , 2001, SIGIR '01.

[13]  S. Robertson The probability ranking principle in IR , 1997 .

[14]  Forbes J. Burkowski,et al.  An Algebra for Hierarchically Organized Text-Dominate Databases , 1992, Inf. Process. Manag..

[15]  Paul Over,et al.  The TREC-2002 Video Track Report , 2002, TREC.

[16]  Alistair Moffat,et al.  Compression and Coding Algorithms , 2005, IEEE Trans. Inf. Theory.

[17]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[18]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.

[19]  Carolyn J. Crouch,et al.  Improving the retrieval effectiveness of very short queries , 2002, Inf. Process. Manag..

[20]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[21]  Ricardo Baeza-Yates,et al.  Block addressing indices for approximate text retrieval , 2000 .

[22]  William S. Cooper The formalism of probability theory in IR: a foundation or an encumbrance? , 1994, SIGIR '94.

[23]  Michael S. Lew Next-Generation Web Searches for Visual Content , 2000, Computer.

[24]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[25]  Hugh E. Williams,et al.  In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for Text Retrieval Systems , 2004, ACSC.

[26]  Guy E. Blelloch,et al.  Index compression through document reordering , 2002, Proceedings DCC 2002. Data Compression Conference.

[27]  Tetsuya Morita,et al.  A fuzzy document retrieval system using the keyword connection matrix and a learning method , 1991 .

[28]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[29]  Ian H. Witten,et al.  Data compression in full-text retrieval systems , 1993 .

[30]  Rong Jin,et al.  Title language model for information retrieval , 2002, SIGIR '02.

[31]  Dario Lucarella,et al.  A document retrieval system based on nearest neighbour searching , 1988, J. Inf. Sci..

[32]  Ian H. Witten,et al.  Lossless Compression for Text and Images , 1997 .

[33]  Peter Bailey,et al.  Engineering a multi-purpose test collection for Web retrieval experiments , 2003, Inf. Process. Manag..

[34]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[35]  Christos Faloutsos,et al.  Access methods for text , 1985, CSUR.

[36]  Ian Soboroff,et al.  Does WT10g look like the web? , 2002, SIGIR '02.

[37]  Susan Siegfried,et al.  An Analysis of Search Terminology Used by Humanities Scholars: The Getty Online Searching Project Report Number 1 , 1993, The Library Quarterly.

[38]  Richard A. Harshman,et al.  Information retrieval using a singular value decomposition model of latent semantic structure , 1988, SIGIR '88.

[39]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[40]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[41]  Ken J. McDonell An Inverted Index Implementation , 1977, Comput. J..

[42]  Robert F. Rice,et al.  Some practical universal noiseless coding techniques , 1979 .

[43]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[44]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[45]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[46]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[47]  Andrew Trotman,et al.  Compressing Inverted Files , 2004, Information Retrieval.

[48]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[49]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[50]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[51]  Nicholas J. Belkin,et al.  Information filtering and information retrieval: two sides of the same coin? , 1992, CACM.

[52]  Peter Willett,et al.  A review of the use of inverted files for best match searching in information retrieval systems , 1983 .

[53]  Ingrid Hsieh-Yee,et al.  Effects of Search Experience and Subject Knowledge on the Search Tactics of Novice and Experienced Searchers. , 1993 .

[54]  Ian H. Witten,et al.  An Empirical Evaluation of Coding Methods for Multi-symbol Alphabets , 1994, Inf. Process. Manag..

[55]  Alistair Moffat,et al.  Compressed inverted files with reduced decoding overheads , 1998, SIGIR '98.

[56]  Stavros Christodoulakis,et al.  Multimedia document presentation, information extraction, and document formation in MINOS: a model and a system , 1986, TOIS.

[57]  Kui-Lam Kwok,et al.  A new method of weighting query terms for ad-hoc retrieval , 1996, SIGIR '96.

[58]  William Y. Arms Digital Libraries , 1999 .

[59]  W. Bruce Croft Boolean Queries and Term Dependencies in Probabilistic Retrieval Models. , 1986 .

[60]  Ricardo A. Baeza-Yates,et al.  Proximal nodes: a model to query document databases by content and structure , 1997, TOIS.

[61]  Michael E. Lesk,et al.  Practical Digital Libraries: Books, Bytes, and Bucks , 1997 .

[62]  Ian H. Witten,et al.  The MG retrieval system: compressing for space and speed , 1995, CACM.

[63]  S. Golomb Run-length encodings. , 1966 .

[64]  Robert G. Gallager,et al.  Variations on a theme by Huffman , 1978, IEEE Trans. Inf. Theory.

[65]  Hugh E. Williams,et al.  Efficient phrase querying with an auxiliary index , 2002, SIGIR '02.

[66]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[67]  Ricardo A. Baeza-Yates,et al.  String Searching Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[68]  David Hawking,et al.  Challenges in Enterprise Search , 2004, ADC.

[69]  Edward A. Fox,et al.  Experimental Comparison of Schemes for Interpreting Boolean Queries , 1988 .

[70]  Aviezri S. Fraenkel,et al.  Novel Compression of Sparse Bit-Strings — Preliminary Report , 1985 .

[71]  Daryl J. D'Souza,et al.  Melbourne TREC-9 Experiments , 2000, TREC.

[72]  Carol H. Fenichel,et al.  Online searching: Measures that discriminate among users with different types of experiences , 1981, J. Am. Soc. Inf. Sci..

[73]  Donna K. Harman,et al.  Relevance feedback revisited , 1992, SIGIR '92.

[74]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[75]  Alistair Moffat,et al.  Economical Inversion of Large Text Files , 1992, Comput. Syst..

[76]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[77]  Christos Faloutsos,et al.  Signature files: design and performance comparison of some signature extraction methods , 1985, SIGMOD Conference.

[78]  David Hawking,et al.  Overview of the TREC-9 Web Track , 2000, TREC.

[79]  Michael Persin,et al.  Document filtering for fast ranking , 1994, SIGIR '94.

[80]  William S. Cooper,et al.  Getting beyond Boole , 1988, Inf. Process. Manag..

[81]  Jukka Teuhola,et al.  A Compression Method for Clustered Bit-Vectors , 1978, Inf. Process. Lett..

[82]  Gabriella Kazai,et al.  Overview of the Initiative for the Evaluation of XML retrieval (INEX) 2002 , 2002, INEX Workshop.

[83]  Peter Schäuble,et al.  Determining the effectiveness of retrieval algorithms , 1991, Inf. Process. Manag..

[84]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[85]  Kotagiri Ramamohanarao,et al.  A Signature File Scheme Based on Multiple Organizations for Indexing Very Large Text Databases. , 1990 .

[86]  W. Bruce Croft,et al.  Document Retrieval and Routing Using the INQUERY System , 1994, TREC.

[87]  R. F. Rice,et al.  Some practical universal noiseless coding techniques, part 2 , 1983 .

[88]  Djoerd Hiemstra,et al.  Bayesian extension to the language model for ad hoc information retrieval , 2003, SIGIR.

[89]  Vijay V. Raghavan,et al.  Extended Boolean query processing in the generalized vector space model , 1989, Inf. Syst..

[90]  Kui-Lam Kwok Higher precision for two-word queries , 2002, SIGIR '02.

[91]  Gonzalo Navarro,et al.  (S, C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases , 2003, SPIRE.

[92]  Eric W. Brown,et al.  Fast evaluation of structured queries for information retrieval , 1995, SIGIR '95.

[93]  Hiroshi Ishikawa,et al.  The model, language, and implementation of an object-oriented multimedia knowledge base management system , 1993, TODS.

[94]  Peter Bailey,et al.  Overview of the TREC-8 Web Track , 2000, TREC.

[95]  Alistair Moffat,et al.  Robust and Web Retrieval with Document-Centric Integral Impacts , 2003, TREC.

[96]  Shmuel Tomi Klein,et al.  Improved Inverted File Processing for Large Text Databases , 1995, Australasian Database Conference.

[97]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[98]  Gabriella Kazai,et al.  Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Schloss Dagstuhl, Germany, December 9-11, 2002 , 2002, INEX.

[99]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[100]  Ophir Frieder,et al.  Document normalization revisited , 2002, SIGIR '02.

[101]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[102]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[103]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[104]  Ricky K. Taira,et al.  A temporal evolutionary object-oriented data model for medical image management , 1992, [1992] Proceedings Fifth Annual IEEE Symposium on Computer-Based Medical Systems.

[105]  Ian H. Witten,et al.  How to Build a Digital Library , 2002 .

[106]  H. P. Luhn A stoical approach to mechanized encoding and searching of literary information , 1957 .

[107]  Edward A. Fox,et al.  Extended Boolean Models , 1992, Information retrieval (Boston).

[108]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[109]  Donna K. Harman,et al.  Overview of the Sixth Text REtrieval Conference (TREC-6) , 1997, Inf. Process. Manag..

[110]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[111]  T. Raita,et al.  Markov models for clusters in concordance compression , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).

[112]  Ricardo A. Baeza-Yates,et al.  Integrating contents and structure in text retrieval , 1996, SGMD.

[113]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[114]  C. Paice Soft evaluation of Boolean search queries in information retrieval systems , 1984 .

[115]  William R. Hersh,et al.  TREC 2002 Interactive Track Report , 2002, TREC.

[116]  Gonzalo Navarro,et al.  Large text searching allowing errors , 1997 .

[117]  Ellen M. Voorhees,et al.  Overview of TREC 2001 , 2001, TREC.

[118]  James Allan,et al.  Automatic Retrieval With Locality Information Using SMART , 1992, TREC.

[119]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[120]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[121]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[122]  Fredric C. Gey,et al.  The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic Using English, French or Arabic Queries , 2001, TREC.

[123]  Michael Heine,et al.  Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW , 2002, J. Documentation.

[124]  Donna Harman,et al.  Retrieving Records from a Gigabyte of Text on a Minicomputer Using Statistical Ranking. , 1990 .

[125]  Ron Sacks-Davis,et al.  Efficient passage ranking for document databases , 1999, TOIS.

[126]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[127]  Donna K. Harman,et al.  Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[128]  Shmuel Tomi Klein,et al.  Modeling word occurrences for the compression of concordances , 1997, TOIS.

[129]  Ian H. Witten,et al.  Towards a digital library of popular music , 1999, DL '99.

[130]  Shmuel Tomi Klein,et al.  Simple Bayesian Model for Bitmap Compression , 2004, Information Retrieval.

[131]  Ellen M. Voorhees,et al.  Overview of the seventh text retrieval conference (trec-7) [on-line] , 1999 .

[132]  Ricardo A. Baeza-Yates,et al.  A language for queries on structure and contents of textual databases , 1995, SIGIR '95.

[133]  Andrew Hume A tale of two greps , 1988, Softw. Pract. Exp..

[134]  Ellen M. Voorhees,et al.  1998 TREC-7 Spoken Document Retrieval Track Overview and Results , 1998 .

[135]  Gaston H. Gonnet,et al.  Unstructured data bases or very efficient text searching , 1983, PODS.

[136]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[137]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[138]  David Hawking,et al.  Efficiency/effectiveness trade-offs in query processing (from theory into practice workshop, 1998 SIGIR conf.) , 1998, SIGF.

[139]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[140]  Donna K. Harman,et al.  Overview of the Fifth Text REtrieval Conference (TREC-5) , 1996, TREC.

[141]  James P. Callan,et al.  Language Models and Structured Document Retrieval , 2002, INEX Workshop.