Information preserving multi-objective feature selection for unsupervised learning

In this work we propose a novel, sound framework for evolutionary feature selection in unsupervised machine learning problems. We show that unsupervised feature selection is inherently multi-objective and behaves differently from supervised feature selection in that the number of features must be maximized instead of being minimized. Although this might sound surprising from a supervised learning point of view, we exemplify this relationship on the problem of data clustering and show that existing approaches do not pose the optimization problem in an appropriate way. Another important consequence of this paradigm change is a method which segments the Pareto sets produced by our approach. Inspecting only prototypical points from these segments drastically reduces the amount of work for selecting a final solution. We compare our methods against existing approaches on eight data sets.

[1]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[2]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[3]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[4]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[5]  Fionn Murtagh,et al.  Clustering in massive data sets , 2002 .

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Lothar Thiele,et al.  Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach , 1999, IEEE Trans. Evol. Comput..

[8]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[9]  C. A. Coello Coello,et al.  A Comprehensive Survey of Evolutionary-Based Multiobjective Optimization Techniques , 1999, Knowledge and Information Systems.

[10]  DebK.,et al.  A fast and elitist multiobjective genetic algorithm , 2002 .

[11]  Filippo Menczer,et al.  Feature selection in unsupervised learning via evolutionary search , 2000, KDD '00.

[12]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[13]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[14]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[15]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[16]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[18]  C. Emmanouilidis,et al.  A multiobjective evolutionary setting for feature selection and a commonality-based crossover operator , 2000, Proceedings of the 2000 Congress on Evolutionary Computation. CEC00 (Cat. No.00TH8512).

[19]  Flávio Bortolozzi,et al.  Unsupervised feature selection using multi-objective genetic algorithms for handwritten word recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[20]  Volker Roth,et al.  Feature Selection in Clustering Problems , 2003, NIPS.

[21]  Filippo Menczer,et al.  Evolutionary model selection in unsupervised learning , 2002, Intell. Data Anal..

[22]  O L Mangasarian,et al.  Computer-derived nuclear "grade" and breast cancer prognosis. , 1995, Analytical and quantitative cytology and histology.

[23]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .