论文信息 - Term-ordered query evaluation versus document-ordered query evaluation for large document databases

Term-ordered query evaluation versus document-ordered query evaluation for large document databases

There are two main families of technique for eficient processing of ranked queries on large text collections: document-ordered processing and term-ordered processing. In this note we compare these techniques experimentally. We show that they have similar costs for short queries, but that for long queries document-ordered processing is much more costly. Overall, we conclude that term-ordered processing, with the refinements of limited accumulators and hierarchical index structuring, is the more eficient mechanism. Techniques for evaluation of ranked queries on large text collections are well developed. In a typical ranked system each document in the collection is heuristically assigned a score representing its similarity to the query, and the documents with the highest scores are returned to the user. The most efficient of the current systems are based on inverted files; query evaluation involves fetching of inverted files, processing them to determine similarity values, then fetching of the top-scoring documents. Typically the number of documents fetched is small, whereas a high proportion of documents in the collection will have a non-zero similarity. These evaluation techniques are used in many applications, ranging from the short queries posed to Internet search engines, typically of two to five words, to extended queries posed by searching experts and long queries generated by techniques such as query expansion and relevance feedback. These techniques can provide better effectiveness than straightforward ranking, but involve many more query terms and are thus lead to increases in query evaluation costs. There are two principal techniques for evaluation of ranked queries: term-ordered (TO) processing and document-ordered (DO) processing. Both are based on inverted files [2, 71, a data structure containing, for each term, a sorted inverted list of the identifiers of the documents in which the term appears and the frequency of the term in each document. We compare TO and DO processing experimentally. In TO processing, the inverted list of each term is processed in full before the next is considered. For each document d in which each term appears, a partial similarity value is computed from the inverted list. Each partial similarity value is added to an accumulator corresponding to d. When processing of the inverted lists is complete, the acPermission to make digital/hard copy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or fee. SIGIR’98, Melbourne, Australia @ 1998 ACM l-58113-015-5 8/98 $5.00. cumulators are sequentially processed to normalise them with regard to document length and to identify the highest normalised scores. This style of processing is used, for example, in SMART [4] and MG (61. A variant on TO processing is to limit the number of accumulators, to say 2% of the total number of documents, and to structure the lists hierarchically [2]. In this TO’ or “skipping” style of processing, rare terms are considered first, and are free to add accumulators, up to the limit, as new document identifiers are observed. When the accumulator limit is reached no further accumulators can be added, and only a fraction of the information in the subsequent inverted lists is used; the hierarchically structuring allows this information to be skipped, significantly reducing CPU time (for list and accumulator processing) and memory requirements (for accumulators). Reducing the limit on the number of accumulators simultaneously reduces both memory requirements and processing time, but also reduces the ability of the mechanism to identify relevant documents, that is, reduces its effectiveness. Another variant of TO processing is to reorder lists by in-document frequency, so that larger partial similarities are to the front of each inverted list. We do not experiment with frequency-sorting here (as it is incompatible with the broader aims of our research, into passage ranking), but it allows significant gains over TO’ processing [3]. In DO processing, the inverted lists for all the query terms are processed simultaneously, in document order. At each stage the least document identifier d in any list is found, all information about d is consumed from the front of all lists in which d is referenced, a similarity value is computed for d, and processing proceeds to the next least document. Only a small number of intermediate resultsfinal similarity values-are required. Thus DO processing has the advantage of not requiring memory space for accumulators, but has several potential disadvantages. First, either enough buffer space must be allocated to hold all inverted lists simultaneously or query evaluation times will rise because several disk accesses are required to fetch each inverted list; in contrast, with TO processing it is feasible to fetch the whole of all but the longest lists, because lists are fetched in turn. Second, as query length increases the cost of identifying the list with the least document identifier will gradually dominate, as this cost is O(n log n) in the number of query terms, while all other costs are asymptotically constant or linear. Third, with DO processing it is not possible to use optimisations such as skipping. Turtle and Flood’s analysis of the performance of ranking algorithms in limited memory suggests that DO is more efficient than TO [5]. However, the model of processing used in this analysis is based on simplifying assumptions that are not valid; in particular, these assumptions imply that the processing costs are linear in the volume of inverted index information required (which is false for DO)

Justin Zobel | Marcin Kaszkiel | J. Zobel | Marcin Kaszkiel

[1] Howard R. Turtle,et al. Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[2] Ian H. Witten,et al. Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[3] Gerard Salton,et al. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[4] Alistair Moffat,et al. Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[5] Ian H. Witten,et al. Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[6] Donna K. Harman,et al. Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[7] Ron Sacks-Davis,et al. Filtered document retrieval with frequency-sorted indexes , 1996 .

[8] Kotagiri Ramamohanarao,et al. Inverted files versus signature files for text indexing , 1998, TODS.