论文信息 - Application of k Nearest Neighbor on Feature Projections Classi er to Text Categorization

Application of k Nearest Neighbor on Feature Projections Classi er to Text Categorization

Application of kNearest Neighbor on FeaturePro jections Classier to Text Categorizati onTuba Yavuz and H Altay GuvenirDepartment of Computer Engineering and Information ScienceBilkent University Ankara BilkentfytubaguvenirgcsbilkentedutrAbstractThis pap er presents the results of the application of aninstancebased learning algorithmkNearest NeighborMethodonFeature Pro jections kNNFP to text categorization and compares it withkNearest NeighborClassier kNNNNFP is similar toNN except it nds the nearest neighb ors according to each feature separatelyThen it combines these predictions using a ma jorityvoting This property causeskNNFP to eliminate p ossible adverse eects of irrelevantfeatures on the classication accuracy Exp erimental evidence indicatesthatkNNFP is sup erior toNN in terms of classication accuracy inthe presence of irrelevant features in many real world domainsIntro ductionAs technological improvements b egan to supp ort the storage of high volume data feasible implementation of applications that can use such amount of data came to discussionInformation Retrieval IR is such an area that the applications in its domain mostlyrequire use of large amount of dataMoreover the do cuments that it deals with aremostly in natural language Therefore high volume data is not the single factor thataects the design decisions for IR applications but also the content of the do cumentsp oses signi cant problems to deal withText categorization which is the pro cess of assigning prede ned categories to textdo cuments is just one of the hot topics of IR that requires exibility to handle a largevolume of data eciently to pro cess and to understand the content of the data to adegree that will give meaningful resultsMany machine learning algorithms have b een applied to text categorization so farThese include symb olic and statistical approaches Exp eriments regarding these worksgive promising results However most of algorithms are not scaleable with the size offeature set which is expressed in the order of tens of thousands This requires reductionof feature set or training set in sucha way that the accuracy would not degradeOn the other hand algorithms likekNN and Linear Least Squares Fit LLSF mapping metho d can b e used with large set of features compared to the other existingmetho dsThis pap er examines the p erformance of a new version of the nearest neighb or algorithm which is calledkNNFP when applied to text categorization TheNNFPclassi er is a variantofkNN TheNNFP classi er nds thekNearest Neighborsseparately for each feature whereas thekNN classi er nds thekNearest Neighborsbyconsidering all the features together The exp erimentwas done by using these classi ers

Tuba Yavuz

[1] Andreas S. Weigend,et al. A neural network approach to topic spotting , 1995 .

[2] Yiming Yang,et al. Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[3] Yoram Singer,et al. Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[4] David L. Waltz,et al. Trading MIPS and memory for knowledge engineering , 1992, CACM.

[5] Yiming Yang,et al. An application of least squares fit mapping to text information retrieval , 1993, SIGIR.

[6] Y Yang,et al. An evaluation of statistical approaches to MEDLINE indexing. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[7] David D. Lewis,et al. A comparison of two learning algorithms for text categorization , 1994 .

[8] Sholom M. Weiss,et al. Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[9] David L. Waltz,et al. Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[10] James P. Callan,et al. Training algorithms for linear text classifiers , 1996, SIGIR '96.