High-dimensional data clustering

Clustering in high-dimensional spaces is a difficult problem which is recurrent in many domains, for example in image analysis. The difficulty is due to the fact that high-dimensional data usually live in different low-dimensional subspaces hidden in the original space. This paper presents a family of Gaussian mixture models designed for high-dimensional data which combine the ideas of dimension reduction and parsimonious modeling. These models give rise to a clustering method based on the Expectation-Maximization algorithm which is called High-Dimensional Data Clustering (HDDC). In order to correctly fit the data, HDDC estimates the specific subspace and the intrinsic dimension of each group. Our experiments on artificial and real datasets show that HDDC outperforms existing methods for clustering high-dimensional data

[1]  Stáephane Girard,et al.  A nonlinear PCA based on manifold approximation , 2000, Comput. Stat..

[2]  Benzion Boukai,et al.  The Discrimination Subspace Model , 1997 .

[3]  GunopulosDimitrios,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998 .

[4]  R. Cattell The Scree Test For The Number Of Factors. , 1966, Multivariate behavioral research.

[5]  W. Gautschi,et al.  An algorithm for simultaneous orthogonal transformation of several positive definite symmetric matrices to nearly diagonal form , 1986 .

[6]  W. V. McCarthy,et al.  Discriminant Analysis with Singular Covariance Matrices: Methods and Applications to Spectroscopic Data , 1995 .

[7]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  J. Bezdek,et al.  Detection and Characterization of Cluster Substructure II. Fuzzy c-Varieties and Convex Combinations Thereof , 1981 .

[10]  James R. Schott Dimensionality reduction in quadratic discriminant analysis , 1993 .

[11]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[12]  Joseph S. Verducci,et al.  Multivariate Statistical Modeling and Data Analysis. , 1988 .

[13]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[14]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[15]  M. Schader,et al.  New Approaches in Classification and Data Analysis , 1994 .

[16]  J. B. Ramsey,et al.  Estimating Mixtures of Normal Distributions and Switching Regressions , 1978 .

[17]  T. Pavlenko,et al.  Effect of dimensionality on discrimination , 2001 .

[18]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[19]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[20]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[21]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[22]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[23]  Chao Yang,et al.  ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods , 1998, Software, environments, tools.

[24]  ZhangJ.,et al.  Local Features and Kernels for Classification of Texture and Object Categories , 2007 .

[25]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[26]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[27]  J. Carroll,et al.  K-means clustering in a low-dimensional Euclidean space , 1994 .

[28]  Geoffrey J. McLachlan,et al.  Modelling high-dimensional data by mixtures of factor analyzers , 2003, Comput. Stat. Data Anal..

[29]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[30]  I. Jolliffe Principal Component Analysis , 2002 .

[31]  Maurizio Vichi,et al.  A mixture model for the classification of three-way proximity data , 2006, Comput. Stat. Data Anal..

[32]  Cordelia Schmid,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[33]  G. Celeux,et al.  A Classification EM algorithm for clustering and two stochastic versions , 1992 .

[34]  Michael I. Jordan,et al.  Mixtures of Probabilistic Principal Component Analyzers , 2001 .

[35]  J. Bezdek,et al.  DETECTION AND CHARACTERIZATION OF CLUSTER SUBSTRUCTURE I. LINEAR STRUCTURE: FUZZY c-LINES* , 1981 .

[36]  E. Diday,et al.  Introduction à l'analyse factorielle typologique , 1974 .

[37]  Hans-Hermann Bock,et al.  On the Interface between Cluster Analysis, Principal Component Analysis, and Multidimensional Scaling , 1987 .

[38]  C. Schmid,et al.  Object Class Recognition Using Discriminative Local Features , 2005 .

[39]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[40]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[41]  T. Pavlenko On feature selection, curse-of-dimensionality and error probability in discriminant analysis , 2003 .

[42]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[43]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[44]  B. Flury Common Principal Components in k Groups , 1984 .

[45]  François Poulet,et al.  OMEGA: Observatoire pour la Minéralogie, l'Eau, les Glaces et l'Activité , 2004 .

[46]  H. Bock Probabilistic models in cluster analysis , 1996 .

[47]  Jeanny Hérault,et al.  Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets , 1997, IEEE Trans. Neural Networks.

[48]  W. DeSarbo,et al.  A maximum likelihood methodology for clusterwise linear regression , 1988 .

[49]  T. Hastie,et al.  Principal Curves , 2007 .