The importance of text mining stems from the availability of huge volumes of text databases holding a wealth of valuable information that needs to be mined. Text categorization is the process of assigning categories or labels to documents based entirely on their contents. Formally, it can be viewed as a mapping from the document space into a set of predefined class labels (aka subjects or categories); F: D← {C1, C2...Cn} where F is the mapping function, D is the document space and {C1, C2...Cn} is the set of class labels. Given an unlabeled document d, we need to find its class label, Ci, using the mapping function F where F(d) = Ci. In this paper, an optimized k-Nearest Neighbors (KNN) classifier that uses intervalization and the P-tree1 technology to achieve a high degree of accuracy, space utilization and time efficiency is proposed: As new samples arrive, the classifier finds the k nearest neighbors to the new sample from the training space without a single database scan.
[1]
Gerard Salton,et al.
A vector space model for automatic indexing
,
1975,
CACM.
[2]
Qin Ding,et al.
k-nearest Neighbor Classification on Spatial Data Streams Using P-trees
,
2002,
PAKDD.
[3]
Qiang Ding,et al.
Deriving High Confidence Rules from Spatial Data Using Peano Count Trees
,
2001,
WAIM.
[4]
Gerard Salton,et al.
Term-Weighting Approaches in Automatic Text Retrieval
,
1988,
Inf. Process. Manag..
[5]
Qin Ding,et al.
The P-tree algebra
,
2002,
SAC '02.
[6]
Imad Rahal.
Query Acceleration in Multi-level Secure Database Systems Using the P-tree Technology
,
2003
.
[7]
Nello Cristianini,et al.
Classification using String Kernels
,
2000
.