A unifying criterion for unsupervised clustering and feature selection

Exploratory data analysis methods are essential for getting insight into data. Identifying the most important variables and detecting quasi-homogenous groups of data are problems of interest in this context. Solving such problems is a difficult task, mainly due to the unsupervised nature of the underlying learning process. Unsupervised feature selection and unsupervised clustering can be successfully approached as optimization problems by means of global optimization heuristics if an appropriate objective function is considered. This paper introduces an objective function capable of efficiently guiding the search for significant features and simultaneously for the respective optimal partitions. Experiments conducted on complex synthetic data suggest that the function we propose is unbiased with respect to both the number of clusters and the number of features.

[1]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[2]  Christian Borgelt Fuzzy Subspace Clustering , 2008, GfKl.

[3]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[4]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[5]  Henri Luchian,et al.  Symbolic regression on noisy data with genetic and gene expression programming , 2005, Seventh International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC'05).

[6]  Yuchou Chang,et al.  Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm , 2008, Pattern Recognit..

[7]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[8]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[9]  Henri Luchian,et al.  Evolutionary automated classification , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[10]  Luis Talavera,et al.  Feature Selection as a Preprocessing Step for Hierarchical Clustering , 1999, ICML.

[11]  Christopher Leckie,et al.  An Evaluation of Criteria for Measuring the Quality of Clusters , 1999, IJCAI.

[12]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[13]  Joshua D. Knowles,et al.  Feature subset selection in unsupervised learning via multiobjective optimization , 2006 .

[14]  James C. Bezdek,et al.  Some new indexes of cluster validity , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[15]  Joshua D. Knowles,et al.  Improvements to the scalability of multiobjective clustering , 2005, 2005 IEEE Congress on Evolutionary Computation.

[16]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Michal Linial,et al.  Novel Unsupervised Feature Filtering of Biological Data , 2006, ISMB.

[19]  Filippo Menczer,et al.  Evolutionary model selection in unsupervised learning , 2002, Intell. Data Anal..

[20]  David G. Stork,et al.  Pattern Classification , 1973 .

[21]  Yiu-ming Cheung,et al.  A new feature selection method for Gaussian mixture clustering , 2009, Pattern Recognit..

[22]  Lance D. Chambers,et al.  Practical Handbook of Genetic Algorithms: New Frontiers , 1995 .

[23]  Maurizio Vichi,et al.  Studies in Classification Data Analysis and knowledge Organization , 2011 .

[24]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[25]  Nicolaj Søndberg-Madsen,et al.  Unsupervised Feature Subset Selection , 2003 .

[26]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[27]  Minho Kim,et al.  New indices for cluster validity assessment , 2005, Pattern Recognit. Lett..

[28]  Flávio Bortolozzi,et al.  Unsupervised feature selection using multi-objective genetic algorithms for handwritten word recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..