An analysis of approximate versus exact discrimination values

Abstract Term discrimination values have been used to characterize and select potential index terms for use during automatic indexing. Two basic approaches to the calculation of discrimination values have been suggested. These approaches differ in their calculation of space density; one method uses the average document-pair similarity for the collection and the other constructs an artificial, “average” document, the centroid, and computes the sum of the similarities of each document with the centroid. The former method has been said to produce “exact” discrimination values and the latter “approximate” values. This article investigates the differences between the algorithms associated with these two approaches (as well as several modified versions of the algorithms) in terms of their impact on the discrimination value model by determining the differences that exist between the rankings of the exact and approximate discrimination values. The experimental results show that the rankings produced by the exact approach and by a centroid-based algorithm suggested by the author are highly compatible. These results indicate that a previously suggested method involving the calculation of exact discrimination values cannot be recommended in view of the excessive cost associated with such an approach; the approximate (i.e., “exact centroid”) approach discussed in this article yields a comparable result at a cost that makes its use feasible for any of the experimental document collections currently in use.