Integrating Clustering and Supervised Learning for Categorical Data Analysis

The problem of fuzzy clustering of categorical data, where no natural ordering among the elements of a categorical attribute domain can be found, is an important problem in exploratory data analysis. As a result, a few clustering algorithms with focus on categorical data have been proposed. In this paper, a modified differential evolution (DE)-based fuzzy c-medoids (FCMdd) clustering of categorical data has been proposed. The algorithm combines both local as well as global information with adaptive weighting. The performance of the proposed method has been compared with those using genetic algorithm, simulated annealing, and the classical DE technique, besides the FCMdd, fuzzy k-modes, and average linkage hierarchical clustering algorithm for four artificial and four real life categorical data sets. Statistical test has been carried out to establish the statistical significance of the proposed method. To improve the result further, the clustering method is integrated with a support vector machine (SVM), a well-known technique for supervised learning. A fraction of the data points selected from different clusters based on their proximity to the respective medoids is used for training the SVM. The clustering assignments of the remaining points are thereafter determined using the trained classifier. The superiority of the integrated clustering and supervised learning approach has been demonstrated.

[1]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[2]  K. Stratos,et al.  Constrained Optimization , 2018, Design Optimization using MATLAB and SOLIDWORKS.

[3]  D. Wolfe,et al.  Nonparametric Statistical Methods. , 1974 .

[4]  El-Ghazali Talbi,et al.  Clustering Nominal and Numerical Data: A New Distance Concept for a Hybrid Genetic Algorithm , 2004, EvoCOP.

[5]  Lawrence O. Hall,et al.  Fuzzy Ants and Clustering , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[6]  R. Storn,et al.  Differential Evolution - A simple and efficient adaptive scheme for global optimization over continuous spaces , 2004 .

[7]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[8]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[9]  R. Storn,et al.  Differential Evolution: A Practical Approach to Global Optimization (Natural Computing Series) , 2005 .

[10]  Witold Pedrycz,et al.  Advances in Fuzzy Clustering and its Applications , 2007 .

[11]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[12]  Lawrence D. Stone,et al.  Constrained Optimization of Functionals with Search Theory Applications , 1981, Math. Oper. Res..

[13]  David G. Wilson,et al.  2. Constrained Optimization , 2005 .

[14]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[15]  Edward Y. Chang,et al.  Using one-class and two-class SVMs for multiclass image annotation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[16]  W. L. Ruzzo,et al.  An empirical study on Principal Component Analysis for clustering gene expression data , 2000 .

[17]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[18]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[19]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Ujjwal Maulik,et al.  A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA , 2008, IEEE Transactions on Evolutionary Computation.

[21]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[22]  R. Krishnapuram,et al.  A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering , 1999, FUZZ-IEEE'99. 1999 IEEE International Fuzzy Systems. Conference Proceedings (Cat. No.99CH36315).

[23]  Swagatam Das,et al.  Automatic Clustering Using an Improved Differential Evolution Algorithm , 2007 .

[24]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[25]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[26]  Emile H. L. Aarts,et al.  Simulated Annealing: Theory and Applications , 1987, Mathematics and Its Applications.

[27]  S. Bandyopadhyay,et al.  Nonparametric genetic clustering: comparison of validity indices , 2001, IEEE Trans. Syst. Man Cybern. Syst..

[28]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[29]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[30]  Manoj Kumar Tiwari,et al.  Interactive Particle Swarm: A Pareto-Adaptive Metaheuristic to Multiobjective Optimization , 2008, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[31]  A. A. Zhigli︠a︡vskiĭ,et al.  Stochastic Global Optimization , 2007 .

[32]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[33]  C. Borror Nonparametric Statistical Methods, 2nd, Ed. , 2001 .

[34]  Ujjwal Maulik,et al.  Fuzzy partitioning using a real-coded variable-length genetic algorithm for pixel classification , 2003, IEEE Trans. Geosci. Remote. Sens..

[35]  Chia-Feng Juang,et al.  Fuzzy System Learned Through Fuzzy Clustering and Support Vector Machine for Human Skin Color Segmentation , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[36]  A. Zhigljavsky Stochastic Global Optimization , 2008, International Encyclopedia of Statistical Science.

[37]  Rainer Storn,et al.  Differential Evolution-A simple evolution strategy for fast optimization , 1997 .

[38]  Lawrence. Davis,et al.  Handbook Of Genetic Algorithms , 1990 .

[39]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[41]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[42]  J. Bezdek,et al.  VAT: a tool for visual assessment of (cluster) tendency , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).