Text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has become very important in the information retrieval area, where information needs have tremendously increased with the rapid growth of textual information sources such as the Internet. We compare, for text categorization, two partially supervised (or semi-supervised) clustering algorithms: the Semi-Supervised Agglomerative Hierarchical Clustering (ssAHC) algorithm (A. Amar et al., 1997) and the Semi-Supervised Fuzzy-c-Means (ssFCM) algorithm (M. Amine et al., 1996). This (semi-supervised) learning paradigm falls somewhere between the fully supervised and the fully unsupervised learning schemes, in the sense that it exploits both class information contained in labeled data (training documents) and structure information possessed by unlabeled data (test documents) in order to produce better partitions for test documents. Our experiments, make use of the Reuters 21578 database of documents and consist of a binary classification for each of the ten most populous categories of the Reuters database. To convert the documents into vector form, we experiment with different numbers of features, which we select, based on an information gain criterion. We verify experimentally that ssFCM both outperforms and takes less time than the Fuzzy-c-Means (FCM) algorithm. With a smaller number of features, ssFCM's performance is also superior to that of ssAHC's. Finally ssFCM results in improved performance and faster execution time as more weight is given to training documents.
[1]
James C. Bezdek,et al.
Pattern Recognition with Fuzzy Objective Function Algorithms
,
1981,
Advanced Applications in Pattern Recognition.
[2]
Robert Tibshirani,et al.
An Introduction to the Bootstrap
,
1994
.
[3]
Andreas S. Weigend,et al.
A neural network approach to topic spotting
,
1995
.
[4]
Gerard Salton,et al.
Term-Weighting Approaches in Automatic Text Retrieval
,
1988,
Inf. Process. Manag..
[5]
David D. Lewis,et al.
A comparison of two learning algorithms for text categorization
,
1994
.
[6]
Amine Bensaid,et al.
Semi-Supervised Hierarchical Clustering Algorithms
,
1997,
SCAI.
[7]
Yiming Yang,et al.
A Comparative Study on Feature Selection in Text Categorization
,
1997,
ICML.
[8]
James C. Bezdek,et al.
Partially supervised clustering for image segmentation
,
1996,
Pattern Recognit..
[9]
Sebastian Thrun,et al.
Learning to Classify Text from Labeled and Unlabeled Documents
,
1998,
AAAI/IAAI.
[10]
Sholom M. Weiss,et al.
Automated learning of decision rules for text categorization
,
1994,
TOIS.