Improving Retrieval Performance with Positive and Negative Equivalence Classes of Terms

One of the most pressing problems facing application developers in the area of information retrieval (IR) is the lack of sound mathematical, theoretical frameworks for understanding IR [SIGIR2000]. Although many such frameworks have been proposed, in the final analysis none has been sufficiently well-grounded to attain widespread acceptance in the field. In addition, there is all too often a lack of empirically sound evaluation of such frameworks in an actual application. For this reason we have forayed into the theoretical domain of IR, while at the same time grounded our work in an application of widespread importance, search and retrieval. One need only glance at the statistics of the hit counts of the latest search engines to realize just how important search and retrieval has become. In this paper we present a novel approach to term clustering and its application in improving the performance of search and retrieval. Our approach is firmly grounded in a theoretical framework that we have developed. Term clustering is an approach that researchers have used to convert the original words of a document into more effective content identifiers. Term clustering algorithms generally consist of two phases. In the first phase term-term similarity is determined. The second phase uses the term-term similarities to develop clusters of terms. Latent Semantic Indexing (LSI) [Deerwester, et al – 1990] is a well-known information retrieval algorithm that is based on Singular Value Decomposition (SVD). The values in the truncated term-term matrix produced by SVD can be treated as similarity measures for input to a clustering algorithm. In this work we present an algorithm that produces clusters of terms that improve retrieval performance (as measured by precision and recall). We assume that the value in position (i,j) of the term-term matrix represents the similarity between term i and term j in the collection. By extension, a negative value represents an anti-similarity between term i and term j. Our approach searches for both positive and negative clusters of terms. We show that the positive clusters, when used to expand an initial query, result in significant improvements in recall for a given collection. Furthermore, the negative clusters, when used to prune the result set, result in significant improvements in precision. To our knowledge, these are the first significant results that show that anti-similarity clusters exist and can be used to improve performance of search and retrieval in IR.