Reduced Support Vector Machine Based on k-Mode Clustering for Classification Large Categorical Dataset

The smooth support vector machine (SSVM) is one of the promising algorithms for classification problems. However, it is restricted to work well on a small to moderate dataset. There exist computational difficulties when we use SSVM with non linear kernel to deal with large dataset. Based on SSVM, the reduced support vector machine (RSVM) was proposed to solve these difficulties using a randomly selected subset of data to obtain a nonlinear separating surface. In this paper, we propose an alternative algorithm, k-mode RSVM (KMO-RSVM) that combines RSVM with k-mode clustering technique to handle classification problems on categorical large dataset. In our experiments, we tested the effectiveness of KMO-RSVM on four public available dataset. It turns out that KMO-RSVM can improve speed of running time significantly than SSVM and still obtained a high accuracy. Comparison with RSVM indicates that KMO-RSVM is faster, gets smaller reduced set and comparable testing accuracy than RSVM.

[1]  Yuh-Jye Lee,et al.  Clustering Model Selection for Reduced Support Vector Machines , 2004, IDEAL.

[2]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[3]  Chih-Jen Lin,et al.  A study on reduced support vector machines , 2003, IEEE Trans. Neural Networks.

[4]  Richard M. Everson,et al.  Intelligent Data Engineering and Automated Learning – IDEAL 2004 , 2004, Lecture Notes in Computer Science.

[5]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[6]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[7]  Paul E. Green,et al.  K-modes Clustering , 2001, J. Classif..

[8]  Zengyou He,et al.  A cluster ensemble method for clustering categorical data , 2005, Information Fusion.

[9]  Su-Yun Huang,et al.  Reduced Support Vector Machines: A Statistical Theory , 2007, IEEE Transactions on Neural Networks.

[10]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[11]  O. Mangasarian,et al.  Support vector machines in data mining , 2001 .

[12]  Yuh-Jye Lee,et al.  SSVM: A Smooth Support Vector Machine for Classification , 2001, Comput. Optim. Appl..

[13]  Yuh-Jye Lee,et al.  Variant Methods of Reduced Set Selection for Reduced Support Vector Machines , 2010, J. Inf. Sci. Eng..

[14]  Michael K. Ng,et al.  A Note on K-modes Clustering , 2003, J. Classif..

[15]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[16]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[17]  Yuh-Jye Lee,et al.  RSVM: Reduced Support Vector Machines , 2001, SDM.

[18]  Olvi L. Mangasarian,et al.  Generalized Support Vector Machines , 1998 .

[19]  Michael C. Ferris,et al.  Semismooth support vector machines , 2004, Math. Program..

[20]  H. Ralambondrainy,et al.  A conceptual version of the K-means algorithm , 1995, Pattern Recognit. Lett..

[21]  Ohn Mar San,et al.  An alternative extension of the k-means algorithm for clustering categorical data , 2004 .

[22]  Edwin Diday,et al.  Symbolic clustering using a new dissimilarity measure , 1991, Pattern Recognit..

[23]  Yuh-Jye Lee,et al.  Generating the Reduced Set by Systematic Sampling , 2004, IDEAL.

[24]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .