Execution performance issues in full-text information retrieval

The task of an information retrieval system is to identify documents that will satisfy a user''s information need. Effective fulfillment of this task has long been an active area of research, leading to sophisticated retrieval models for representing information content in documents and queries and measuring similarity between the two. The maturity and proven effectiveness of these systems has resulted in demand for increased capacity, performance, scalability, and functionality, especially as information retrieval is integrated into more traditional database management environments. In this dissertation we explore a number of functionality and performance issues in information retrieval. First, we consider creation and modification of the document collection, concentrating on management of the inverted file index. An inverted file architecture based on a persistent object store is described and experimental results are presented for inverted file creation and modification. Our architecture provides performance that scales well with document collection size and the database features supported by the persistent object store provide many solutions to issues that arise during integration of information retrieval into more general database environments. We then turn to query evaluation speed and introduce a new optimization technique for statistical ranking retrieval systems that support structured queries. Experimental results from a variety of query sets show that execution time can be reduced by more than 50\% with no noticeable impact on retrieval effectiveness, making these more complex retrieval models attractive alternatives for environments that demand high performance.

[1]  Dietmar Wolfram,et al.  Applying Informetric Characteristics of Databases to IR System File Design, Part II: Simulation Comparisons , 1992, Inf. Process. Manag..

[2]  Edward A. Fox,et al.  Research Contributions , 2014 .

[3]  E. Hansen History of Libraries , 1968, Nature.

[4]  Christos Faloutsos,et al.  On B-Tree Indices for Skewed Distributions , 1992, VLDB.

[5]  Maria Elena Smith,et al.  Aspects of the P-Norm Model of Information Retrieval: Syntactic Query Generation, Efficiency, And Theoretical , 1990 .

[6]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[7]  W. Bruce Croft,et al.  Integrating IR and RDBMS using cooperative indexing , 1995, SIGIR '95.

[8]  Dario Lucarella,et al.  A document retrieval system based on nearest neighbour searching , 1988, J. Inf. Sci..

[9]  Christos Faloutsos,et al.  Access methods for text , 1985, CSUR.

[10]  Michael Stonebraker,et al.  Extended User-Defined Indexing with Application to Textual Databases , 1988, VLDB.

[11]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[12]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[13]  Peter Willett,et al.  A review of the use of inverted files for best match searching in information retrieval systems , 1983 .

[14]  W. Bruce Croft,et al.  Fast Incremental Indexing for Full-Text Information Retrieval , 1994, VLDB.

[15]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[16]  Vijay V. Raghavan,et al.  Integration of information retrieval and database management systems , 1988, Inf. Process. Manag..

[17]  Alistair Moffat,et al.  Memory Efficient Ranking , 1994, Inf. Process. Manag..

[18]  S. Robertson The probability ranking principle in IR , 1997 .

[19]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[20]  Christopher J. Fox,et al.  Lexical Analysis and Stoplists , 1992, Information Retrieval: Data Structures & Algorithms.

[21]  W. Bruce Croft,et al.  Implementing ranking strategies using text signatures , 1988, TOIS.

[22]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[23]  Ron Sacks-Davis,et al.  An e cient indexing technique for full-text database systems , 1992, VLDB 1992.

[24]  Christos Faloutsos,et al.  Hybrid Index Organizations for Text Databases , 1992, EDBT.

[25]  Christos Faloutsos,et al.  Signature files: an access method for documents and its analytical performance evaluation , 1984, TOIS.

[26]  Norbert Fuhr,et al.  Efficient processing of vague queries using a data stream approach , 1995, SIGIR '95.

[27]  Hector Garcia-Molina,et al.  Synthetic workload performance analysis of incremental updates , 1994, SIGIR '94.

[28]  Christopher J. Fox,et al.  A stop list for general text , 1989, SIGF.

[29]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[30]  David J. DeWitt,et al.  Object and File Management in the EXODUS Extensible Database System , 1986, VLDB.

[31]  W. Bruce Croft,et al.  Supporting Full-Text Information Retrieval with a Persistent Object Store , 1994, EDBT.

[32]  David A. Grossman,et al.  Structuring Text within a Relational System , 1992, DEXA.

[33]  W. Bruce Croft,et al.  An Association Thesaurus for Information Retrieval , 1994, RIAO.

[34]  Alistair Moffat,et al.  An Efficient Indexing Technique for Full Text Databases , 1992, Very Large Data Bases Conference.

[35]  Edward A. Fox,et al.  FAST-INV: A Fast Algorithm for building large inverted files , 1991 .

[36]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[37]  Ian A. Macleod,et al.  SEQUEL as a Language for Document Retrieval , 2007, J. Am. Soc. Inf. Sci..

[38]  Alexander Cruden,et al.  A Complete concordance to the Holy Scriptures of the Old and New Testaments , 1899 .

[39]  J. Eliot B. Moss,et al.  Design of the Mneme persistent object store , 1990, TOIS.

[40]  David C. Blair,et al.  An extended relational document retrieval model , 1988, Inf. Process. Manag..

[41]  Nicholas J. Belkin,et al.  Retrieval techniques , 1987 .

[42]  Vijay V. Raghavan,et al.  Design of an Integrated Information Retrieval/Database Management System , 1990, IEEE Trans. Knowl. Data Eng..

[43]  Peter Schäuble,et al.  Effective and Efficient Retrieval from Large and Dynamic Document Collections , 1993, TREC.

[44]  Ian H. Witten,et al.  Data Compression in Full-Text Retrieval Systems , 1993, J. Am. Soc. Inf. Sci..

[45]  Howard R. Turtle Natural language vs. Boolean query evaluation: a comparison of retrieval performance , 1994, SIGIR '94.

[46]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[47]  Dik Lun Lee,et al.  Implementations of Partial Document Ranking Using Inverted Files , 1993, Information Processing & Management.

[48]  Jan Jannink,et al.  Implementing deletion in B+-trees , 1995, SGMD.

[49]  W. Bruce Croft,et al.  Combining Automatic and Manual Index Representations in Probabilistic Retrieval , 1995, J. Am. Soc. Inf. Sci..

[50]  Alistair Moffat,et al.  Fast ranking in limited space , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[51]  Robert G. Crawford The relational model in information retrieval , 1981, J. Am. Soc. Inf. Sci..

[52]  Christos Faloutsos,et al.  Bit-Sliced Signature Files for Very Large Text Databases an a Parallel Machine Architecture , 1994, EDBT.

[53]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[54]  Alistair Moffat,et al.  Adding compression to a full‐text retrieval system , 1995, Softw. Pract. Exp..

[55]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[56]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[57]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[58]  Hector Garcia-Molina,et al.  Caching and database scaling in distributed shared-nothing information retrieval systems , 1993, SIGMOD '93.

[59]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[60]  Peter Schäuble,et al.  SPIDER: a multiuser information retrieval system for semistructured and dynamic data , 1993, SIGIR.

[61]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[62]  Eric W. Brown,et al.  Fast evaluation of structured queries for information retrieval , 1995, SIGIR '95.

[63]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[64]  W. Bruce Croft,et al.  Providing Government Information on the Internet: Experiences with THOMAS , 1995, DL.

[65]  Steve Putz Using a Relational Database for an Inverted Text Index , 1991 .

[66]  Shmuel Tomi Klein,et al.  A Systematic Approach to Compressing a Full-Text Retrieval System , 1992, Inf. Process. Manag..

[67]  V. Rich Personal communication , 1989, Nature.

[68]  Alistair Moffat,et al.  Compression and Fast Indexing for Multi-Gigabyte Text Databases , 1994, Aust. Comput. J..

[69]  Dietmar Wolfram,et al.  Applying Informetric Characteristics of Databases to IR System File Design, Part I: Informetric Models , 1992, Inf. Process. Manag..

[70]  W. Bruce Croft,et al.  Efficient probabilistic Inference for text retrieval , 1991, RIAO.

[71]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.

[72]  Bruce G. Lindsay,et al.  The Starburst Long Field Manager , 1989, VLDB.

[73]  Jack A. Orenstein,et al.  The ObjectStore database system , 1991, CACM.

[74]  Edward A. Fox,et al.  Inverted Files , 1992, Information Retrieval: Data Structures & Algorithms.

[75]  Michael Persin,et al.  Document filtering for fast ranking , 1994, SIGIR '94.

[76]  Alistair Moffat,et al.  Self-Indexing Inverted Files , 1994, Australasian Database Conference.

[77]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[78]  Craig Stanfill,et al.  Compression of indexes with full positional information in very large text databases , 1993, SIGIR.

[79]  John R. Kohlenberger The Nrsv Concordance Unabridged: Including the Apocryphal/Deuterocanonical Books , 1991 .

[80]  Alexandros Biliris The performance of three database storage structures for managing large objects , 1992, SIGMOD '92.

[81]  Alexandros Biliris An efficient database storage structure for large dynamic objects , 1992, [1992] Eighth International Conference on Data Engineering.

[82]  Donna Harman,et al.  Retrieving Records from a Gigabyte of Text on a Minicomputer Using Statistical Ranking. , 1990 .

[83]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[84]  Jan O. Pedersen,et al.  Optimization for dynamic inverted index maintenance , 1989, SIGIR '90.

[85]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[86]  W. Bruce Croft Document representation in probabilistic models of information retrieval , 1981, J. Am. Soc. Inf. Sci..

[87]  Alan F. Smeaton,et al.  The nearest neighbour problem in information retrieval: an algorithm using upperbounds , 1981, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[88]  Vivek Singhal,et al.  Texas: An Efficient, Portable Persistent Store , 1992, POS.