Feature Selection and Semi-supervised Clustering Using Multiobjective Optimization

In this paper we have coupled feature selection problem with semi-supervised clustering. Semi-supervised clustering techniques are used to overcome the problems associated with unsupervised and supervised classification. But in general all the features present in the data set may not be important for clustering purpose. Thus appropriate selection of features from the set of all features is very much relevant from clustering point of view. Here, a newly developed multiobjective simulated annealing based optimization technique named archived multiobjective simulated annealing (AMOSA) is used as the underlying optimization technique. Here features and cluster centers are encoded in the form of a string. We assume that for each data set for 10% data points class level information are known to us. Four objective functions are used, first two objective functions represent, respectively, total symmetry present in the clusters and total compactness of the partitioning results. These are based on point symmetry and euclidean distance computations. Third objective function is an external cluster validity index which measures the similarity of the clustering obtained on labeled data with the original labeling, and fourth one counts number of features. Our objective is to optimize values of cluster validity indices where as to increase the number of features in order to remove the bias of internal cluster validity indices on lower dimensions. AMOSA is utilized to detect the appropriate subset of features, actual number of clusters as well as the true partitioning. For the purpose of assignment of data points to respective clusters, a point symmetry distance based new innovative methodology has been adopted. Mutation changes the feature combinations as well as the set of cluster centers. So in this paper, we have implemented a novel method to select a single solution from the Pareto-optimal front. So, the proposed Semi-FeaClustMOO technique ensures to obtain the actual number of clusters as well as the true partitioning result. The efficacy of the proposed Semi-FeaClustMOO technique is shown on three real-life data sets, and compared with genetic algorithm based VGAPS clustering technique and K-mean clustering technique. These Clustering techniques work with all the available features of data sets and Semi-FeaClustMOO technique uses a subset of features during the computation.

[1]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[3]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  JiangDaxin,et al.  Cluster Analysis for Gene Expression Data , 2004 .

[5]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[6]  Joshua D. Knowles,et al.  Feature subset selection in unsupervised learning via multiobjective optimization , 2006 .

[7]  David W. Aha,et al.  A Comparative Evaluation of Sequential Feature Selection Algorithms , 1995, AISTATS.

[8]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Filippo Menczer,et al.  Evolutionary model selection in unsupervised learning , 2002, Intell. Data Anal..

[10]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[11]  Joshua D. Knowles,et al.  On semi-supervised clustering via multiobjective optimization , 2006, GECCO.

[12]  Sanghamitra Bandyopadhyay,et al.  A symmetry based multiobjective clustering technique for automatic evolution of clusters , 2010, Pattern Recognit..

[13]  Pedro Larrañaga,et al.  Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[15]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[16]  Sanghamitra Bandyopadhyay,et al.  GAPS: A clustering method using a new point symmetry-based distance measure , 2007, Pattern Recognit..

[17]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[18]  Ujjwal Maulik,et al.  A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA , 2008, IEEE Transactions on Evolutionary Computation.

[19]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[20]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[21]  W. L. Ruzzo,et al.  An empirical study on Principal Component Analysis for clustering gene expression data , 2000 .

[22]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[23]  Sanghamitra Bandyopadhyay,et al.  A Point Symmetry-Based Clustering Technique for Automatic Evolution of Clusters , 2008, IEEE Transactions on Knowledge and Data Engineering.

[24]  Kalyanmoy Deb,et al.  Multi-objective optimization using evolutionary algorithms , 2001, Wiley-Interscience series in systems and optimization.

[25]  Luis Talavera,et al.  Feature Selection as a Preprocessing Step for Hierarchical Clustering , 1999, ICML.

[26]  Huan Liu,et al.  Handling Large Unsupervised Data via Dimensionality Reduction , 1999, 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[27]  Filippo Menczer,et al.  Efficient and Scalable Pareto Optimization by Evolutionary Local Selection Algorithms , 2000, Evolutionary Computation.

[28]  Mark J. Embrechts,et al.  On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification , 2009, ICANN.

[29]  Shenghuo Zhu,et al.  Gene functional classification by semi-supervised learning from heterogeneous data , 2003, SAC '03.

[30]  Jose Miguel Puerta,et al.  Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking , 2012, Knowl. Based Syst..

[31]  Flávio Bortolozzi,et al.  Unsupervised feature selection using multi-objective genetic algorithms for handwritten word recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[32]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[33]  Sankar K. Pal,et al.  Unsupervised Feature Selection , 2004 .

[34]  Ujjwal Maulik,et al.  Genetic clustering for automatic evolution of clusters and application to image classification , 2002, Pattern Recognit..