An Efficient Feature Selection Approach for Clustering: Using a Gaussian Mixture Model of Data Dissimilarity

Rapid advances in computer and database technologies have enabled organizations to accumulate vast amounts of data recently. These huge data make the data analysis task become more complicated. Feature selection is an effective dimensionality reduction technique by removing irrelevant, redundant, or noisy features. This research proposes a novel feature-selecting measure to evaluate feature importance for clustering process. The proposed measure aims at extracting useful information from the dissimilarity between two data objects since data dissimilarity is a common principle to determine whether data objects can be located within the same cluster or not. Therefore, the dissimilarity between a pair of data objects is used to develop the proposed feature-selecting measure. In the research, the probability distribution of the dissimilarity variable is considered as a mixture model consisting of the two "intra-cluster" and "inter-cluster" dissimilarity Gaussian distributions. The means of the two Gaussian distributions can be inferred by the EM algorithm. Accordingly, the difference between the two means is regarded as a meaningful measure to select important features for clustering. The effectiveness of the proposed feature-selecting measure for clustering is demonstrated using a set of experiments.

[1]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[2]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[3]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[4]  Shashi Shekhar,et al.  Clustering and Information Retrieval , 2011, Network Theory and Applications.

[5]  Christophe Ambroise,et al.  Feature selection in robust clustering based on Laplace mixture , 2006, Pattern Recognit. Lett..

[6]  Filippo Menczer,et al.  Feature selection in unsupervised learning via evolutionary search , 2000, KDD '00.

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[11]  Shusaku Tsumoto,et al.  Foundations of Intelligent Systems, 15th International Symposium, ISMIS 2005, Saratoga Springs, NY, USA, May 25-28, 2005, Proceedings , 2005, ISMIS.

[12]  Paul S. Bradley,et al.  Initialization of Iterative Refinement Clustering Algorithms , 1998, KDD.

[13]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[14]  Pierre-Emmanuel Jouve,et al.  A Filter Feature Selection Method for Clustering , 2005, ISMIS.

[15]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[16]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Weili Wu,et al.  Clustering and Information Retrieval (Network Theory and Applications) , 2003 .

[18]  Huan Liu,et al.  Feature selection for clustering - a filter solution , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[19]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[20]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[21]  Manoranjan Dash,et al.  Dimensionality reduction of unsupervised data , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[22]  Ashwin Ram,et al.  Efficient Feature Selection in Conceptual Clustering , 1997, ICML.

[23]  Carla E. Brodley,et al.  Feature Subset Selection and Order Identification for Unsupervised Learning , 2000, ICML.