Model-based subspace clustering of non-Gaussian data

This paper presents a new generalized Dirichlet (GD) mixture model to address the challenging problem of clustering multidimensional data sets on different feature subsets. We approximate class-conditional distributions of mixture components to define binary relevance of features at the level of clusters. We consider a relevant feature as the one providing the knowledge to assign data points in the cluster. Then, we define a new message length objective to learn the model and select both feature subsets and the number of components. The proposed method is general comparatively with existing feature selection and subspace clustering models. In addition, it selects for each cluster only relevant and statistically independent features in a linear time of the number of observations and dimensions. Experiments on synthetic data and in unsupervised image categorization show the merits of our approach.

[1]  Barbara Caputo,et al.  Recognition with local features: the kernel recipe , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[2]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[3]  Fei Wang,et al.  Boosting GMM and Its Two Applications , 2005, Multiple Classifier Systems.

[4]  Xiao-Li Meng,et al.  The EM Algorithm—an Old Folk‐song Sung to a Fast New Tune , 1997 .

[5]  Anil K. Jain,et al.  Image classification for content-based indexing , 2001, IEEE Trans. Image Process..

[6]  Christian P. Robert,et al.  The Bayesian choice , 1994 .

[7]  Xuelong Li,et al.  Gabor-Based Region Covariance Matrices for Face Recognition , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  C. S. Wallace,et al.  Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics) , 2005 .

[9]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[10]  Marina Meila,et al.  Comparing subspace clusterings , 2006, IEEE Transactions on Knowledge and Data Engineering.

[11]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Xuelong Li,et al.  KPCA for semantic object extraction in images , 2008, Pattern Recognit..

[13]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[14]  B. Frieden,et al.  Physics from Fisher Information: A Unification , 1998 .

[15]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[16]  Nizar Bouguila,et al.  A Hybrid Feature Extraction Selection Approach for High-Dimensional Non-Gaussian Data Clustering , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Peter Grünwald,et al.  Invited review of the book Statistical and Inductive Inference by Minimum Message Length , 2006 .

[18]  Nizar Bouguila,et al.  Unsupervised selection of a finite Dirichlet mixture model: an MML-based approach , 2006, IEEE Transactions on Knowledge and Data Engineering.

[19]  Djemel Ziou,et al.  A Graphical Model for Context-Aware Visual Content Recommendation , 2008, IEEE Transactions on Multimedia.

[20]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[21]  Nizar Bouguila,et al.  A hybrid SEM algorithm for high-dimensional unsupervised learning using a finite generalized Dirichlet mixture , 2006, IEEE Transactions on Image Processing.

[22]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[23]  Anil K. Jain,et al.  Feature Selection in Mixture-Based Clustering , 2002, NIPS.

[24]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[25]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Christos Faloutsos,et al.  QBIC project: querying images by content, using color, texture, and shape , 1993, Electronic Imaging.

[27]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[28]  Nizar Bouguila,et al.  Practical Bayesian estimation of a finite beta mixture through gibbs sampling and its applications , 2006, Stat. Comput..

[29]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[30]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[31]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[32]  Xuelong Li,et al.  Effective Feature Extraction in High-Dimensional Space , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[33]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[34]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[35]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[36]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[37]  J. Friedman Clustering objects on subsets of attributes , 2002 .

[38]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[39]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Andrew Zisserman,et al.  Scene Classification Via pLSA , 2006, ECCV.

[41]  David J. Miller,et al.  Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection , 2006, IEEE Transactions on Signal Processing.

[42]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[43]  Nizar Bouguila,et al.  High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.