A comparison of two weighting schemes for Boolean retrieval

Boolean retrieval logic is the basis of most operating information retrieval systems (IRSs). There are many reasons why this type of system has been so attractive. For example, it allows users to issue requests in which the topics Of interest and the relations between them are clearly and precisely stated; the user has considerable flexibility in formulating his request; and the request can be reformulated, as convenient, to equivalent requests that will retrieve the identical set of documents (Bookstein and Cooper, 1976). Further, it is relatively easy to learn how to use such a system, and, given software that is widely available, such systems can be easily implemented to permit of an efficient search, even of rather large files. For these reasons, a number of intrinsic weaknesses inherent in these systems are often overlooked. A very serious constraint of Boolean systems is the necessity of associating a number of index terms with each document. The problem is that it is often unclear whether a given index term is appropriate for a document b o t h the decision to include the term and the decision to omit it might result in retrieval errors: false drops in the first case, lost relevant documents in the second. The issuer of a request is similarly constrained either to include a term in his request or to leave it out. It is not possible for a patron to include two terms, while indicating that one is more important than the other. The above weaknesses have encouraged the development of alternative approaches, such as the use of vector models (Salton, 1968), which permit the patron and the indexer to differentiate index terms by weight. A user of such a system, however, cannot indicate how the terms logically relate to one another. Others have created multi-stage systems, in which a standard Boolean retrieval process first retrieves a set of documents; these documents are then processed by an independent weighting mechanism that assigns to each retrieved document a value representing the importance of the terms by which the document is indexed (Noreault, Koll and McGill, 1977). Unfortunately, such hybrid methods are subject to inconsistencies, in that two logically equivalent requests can retrieve different sets of documents (Bookstein, 1978).