A Gaussian Mixture Model to Detect Clusters Embedded in Feature Subspace

The goal of unsupervised learning, i.e., clustering, is to determine the intrinsic structure of unlabeled data. Feature selection for clustering improves the performance of grouping by removing irrelevant features. Typical feature selection algorithms select a common feature subset for all the clusters. Consequently, clusters embedded in different feature subspaces are not able to be identified. In this paper, we introduce a probabilistic model based on Gaussian mixture to solve this problem. Particularly, the feature relevance for an individual cluster is treated as a probability, which is represented by localized feature saliency and estimated through Expectation Maximization (EM) algorithm during the clustering process. In addition, the number of clusters is determined simultaneously by integrating a Minimum Message Length (MML) criterion. Experiments carried on both synthetic and real-world datasets illustrate the performance of the proposed approach in finding clusters embedded in feature subspace. 1. Introduction. Clustering is unsupervised classification of data objects into different groups (clusters) such that objects in one group are similar together and dis- similar from another group. Applications of data clustering are found in many fields, such as information discovering, text mining, web analysis, image grouping, medi- cal diagnosis, and bioinformatics. Many clustering algorithms have been proposed in the literature (8). Basically, they can be categorized into two groups: hierarchical or partitional. A clustering algorithm typically considers all available features of the dataset in an attempt to learn as much as possible from data. In practice, however, some features can be irrelevant, and thus hinder the clustering performance. Feature selection, which chooses the "best" feature subset for clustering, can be applied to solve this problem. Feature selection is extensively studied in supervised learning scenario (1-3), where class labels are available for judging the performance improvement contributed by a feature selection algorithm. For unsupervised learning, feature selection is a very dif- ficult problem due to the lack of class labels, and it has received extensive attention recently. The algorithm proposed in (4) measures feature similarity by an information compression index. In (5), the relevant features are detected using a distance-based entropy measure. (6) evaluates the cluster quality over different feature subsets by normalizing cluster separability or likelihood using a cross-projection method. In (7), feature saliency is defined as a probability and estimated by the Expectation Maxi- mization (EM) algorithm using Gaussian mixture models. A variational Bayesian ap-

[1]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[2]  Ravi Kothari,et al.  Feature subset selection using a new definition of classifiability , 2003, Pattern Recognit. Lett..

[3]  Qionghai Dai,et al.  Similarity-based online feature selection in content-based image retrieval , 2006, IEEE Transactions on Image Processing.

[4]  Farshad Fotouhi,et al.  Co-clustering Documents and Words Using Bipartite Isoperimetric Graph Partitioning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[5]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[6]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[7]  Huan Liu,et al.  Feature selection for clustering - a filter solution , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[8]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[9]  Filippo Menczer,et al.  Feature selection in unsupervised learning via evolutionary search , 2000, KDD '00.

[10]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[11]  Chris H. Q. Ding,et al.  Bipartite graph partitioning and data clustering , 2001, CIKM '01.

[12]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[13]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[14]  Jing Hua,et al.  Localized feature selection for clustering , 2008, Pattern Recognit. Lett..

[15]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Aristidis Likas,et al.  Bayesian feature and model selection for Gaussian mixture models , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[18]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Hans-Peter Kriegel,et al.  Subspace selection for clustering high-dimensional data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).