An Anti-noise Text Categorization Method Based on Support Vector Machines

Text categorization has become one of the key techniques for handling and organizing web data. Though the native features of SVM (Support Vector Machines) are better than Naive Bayes' for text categorization in theory, the classification precision of SVM is lower than Bayesian method in real world. This paper tries to find out the mysteries by analyzing the shortages of SVM, and presents an anti-noise SVM method. The improved method has two characteristics: 1) It chooses the optimal n-dimension classifying hyperspace. 2) It separates noise samples by preprocessing, and trains the classifier using noise free samples. Compared with naive Bayes method, the classification precision of anti-noise SVM is increased about 3 to 9 percent.

[1]  Jiawei Han,et al.  Generalization and decision tree induction: efficient classification in data mining , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[2]  Jay Lee,et al.  On-line fault detection using integrated neural networks , 1992, Defense, Security, and Sensing.

[3]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[4]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[5]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[6]  Nir Friedman,et al.  Building Classifiers Using Bayesian Networks , 1996, AAAI/IAAI, Vol. 2.

[7]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[8]  Manfred K. Warmuth,et al.  The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[9]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[10]  David E. Johnson,et al.  Maximizing Text-Mining Performance , 1999 .

[11]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[12]  Zhi-Hua Zhou,et al.  FANNC: A Fast Adaptive Neural Network Classifier , 2000, Knowledge and Information Systems.

[13]  Michael A. Shepherd,et al.  Support vector machines for text categorization , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.