What are Clusters in High Dimensions and are they Difficult to Find?

The distribution of distances between points in a high-dimensional data set tends to look quite different from the distribution of the distances in a low-dimensional data set. Concentration of norm is one of the phenomena from which high-dimensional data sets can suffer. It means that in high dimensions --- under certain general assumptions --- the relative distances from any point to its closest and farthest neighbour tend to be almost identical. Since cluster analysis is usually based on distances, such effects must be taken into account and their influence on cluster analysis needs to be considered. This paper investigates consequences that the special properties of high-dimensional data have for cluster analysis. We discuss questions like when clustering in high dimensions is meaningful at all, can the clusters just be artifacts and what are the algorithmic problems for clustering methods in high dimensions.

[1]  Frank Klawonn,et al.  A contribution to convergence theory of fuzzy c-means and derivatives , 2003, IEEE Trans. Fuzzy Syst..

[2]  Kenneth G. Manton,et al.  Fuzzy Cluster Analysis , 2005 .

[3]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[4]  Alexandros Nanopoulos,et al.  Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data , 2010, J. Mach. Learn. Res..

[5]  F. Klawonn,et al.  Global Genotype-Phenotype Correlations in Pseudomonas aeruginosa , 2010, PLoS pathogens.

[6]  B. Everitt,et al.  Cluster Analysis: Everitt/Cluster Analysis , 2011 .

[7]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[8]  Christian Borgelt,et al.  Resampling for Fuzzy Clustering , 2007, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[9]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[10]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[11]  Ming-Syan Chen,et al.  On the Design and Applicability of Distance Functions in High-Dimensional Data Space , 2009, IEEE Trans. Knowl. Data Eng..

[12]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[13]  C. Borgelt,et al.  The Hubness Phenomenon: Fact or Artifact? , 2013 .

[14]  Frank Klawonn,et al.  Guide to Intelligent Data Analysis - How to Intelligently Make Sense of Real Data , 2010, Texts in Computer Science.

[15]  Stefan Conrad,et al.  Clustering approaches for data with missing values: Comparison and evaluation , 2010, 2010 Fifth International Conference on Digital Information Management (ICDIM).

[16]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[17]  Ji Zhu,et al.  Variable Selection for Model‐Based High‐Dimensional Clustering and Its Application to Microarray Data , 2008, Biometrics.

[18]  Frank Klawonn,et al.  Adaptation of Cluster Sizes in Objective Function Based Fuzzy Clustering Technology , 2002 .

[19]  Frank Klawonn,et al.  Can unbounded distance measures mitigate the curse of dimensionality? , 2012, Int. J. Data Min. Model. Manag..

[20]  James M. Keller,et al.  Fuzzy Models and Algorithms for Pattern Recognition and Image Processing , 1999 .

[21]  Frank Klawonn,et al.  What Is Fuzzy about Fuzzy Clustering? Understanding and Improving the Concept of the Fuzzifier , 2003, IDA.

[22]  Ata Kabán,et al.  When is 'nearest neighbour' meaningful: A converse theorem and implications , 2009, J. Complex..

[23]  Heather J. Ruskin,et al.  Techniques for clustering gene expression data , 2008, Comput. Biol. Medicine.

[24]  Donald Gustafson,et al.  Fuzzy clustering with a fuzzy covariance matrix , 1978, 1978 IEEE Conference on Decision and Control including the 17th Symposium on Adaptive Processes.

[25]  Charu C. Aggarwal,et al.  Re-designing distance functions and distance-based applications for high dimensional data , 2001, SGMD.

[26]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[27]  Frank Klawonn,et al.  Fuzzy c-means in High Dimensional Spaces , 2011, Int. J. Fuzzy Syst. Appl..

[28]  E. M. Wright,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[29]  A. Buja,et al.  Projection Pursuit Indexes Based on Orthonormal Function Expansions , 1993 .

[30]  Alexander Hinneburg,et al.  DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation , 2007, IDA.

[31]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[32]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[33]  J. C. Peters,et al.  Fuzzy Cluster Analysis : A New Method to Predict Future Cardiac Events in Patients With Positive Stress Tests , 1998 .

[34]  F. Klawonn,et al.  Can Fuzzy Clustering Avoid Local Minima and Undesired Partitions , 2013 .

[35]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.