Uncertainty-Based Noise Reduction and Term Selection in Text Categorization

This paper introduces a new criterium for term selection, which is based on the notion of Uncertainty. Term selection according to this criterium is performed by the elimination of noisy terms on a class-by-class basis, rather than by selecting the most significant ones. Uncertainty-based term selection (UC) is compared to a number of other criteria like Information Gain (IG), simplified ?2 (SX), Term Frequency (TF) and Document Frequency (DF) in a Text Categorization setting. Experiments on data sets with different properties (Reuters- 21578, patent abstracts and patent applications) and with two different algorithms (Winnow and Rocchio) show that UC-based term selection is not the most aggressive term selection criterium, but that its effect is quite stable across data sets and algorithms. This makes it a good candidate for a general "install-and-forget" term selection mechanism. We also describe and evaluate a hybrid Term Selection technique, first applying UC to eliminate noisy terms and then using another criterium to select the best terms.