Robust clustering in high dimensional data using statistical depths

BackgroundMean-based clustering algorithms such as bisecting k-means generally lack robustness. Although componentwise median is a more robust alternative, it can be a poor center representative for high dimensional data. We need a new algorithm that is robust and works well in high dimensional data sets e.g. gene expression data.ResultsHere we propose a new robust divisive clustering algorithm, the bisecting k-spatialMedian, based on the statistical spatial depth. A new subcluster selection rule, Relative Average Depth, is also introduced. We demonstrate that the proposed clustering algorithm outperforms the componentwise-median-based bisecting k-median algorithm for high dimension and low sample size (HDLSS) data via applications of the algorithms on two real HDLSS gene expression data sets. When further applied on noisy real data sets, the proposed algorithm compares favorably in terms of robustness with the componentwise-median-based bisecting k-median algorithm.ConclusionStatistical data depths provide an alternative way to find the "center" of multivariate data sets and are useful and robust for clustering.

[1]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[2]  H. Oja Descriptive Statistics for Multivariate Distributions , 1983 .

[3]  S. D. Chatterji Proceedings of the International Congress of Mathematicians , 1995 .

[4]  K. Mosler,et al.  Zonoid trimming for multivariate distributions , 1997 .

[5]  Rebecka Jörnsten Clustering and classification based on the L 1 data depth , 2004 .

[6]  R. Serfling A Depth Function and a Scale Curve Based on Spatial Quantiles , 2002 .

[7]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[8]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[9]  Chris H. Q. Ding,et al.  Cluster merging and splitting in hierarchical clustering algorithms , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[10]  Jian Zhang Some Extensions of Tukey's Depth Function , 2002 .

[11]  V. Koltchinskii M-estimation, convexity and quantiles , 1997 .

[12]  P. Chaudhuri On a geometric notion of quantiles for multivariate data , 1996 .

[13]  A. Chadli THE CANCER CELL , 1924, La Presse medicale.

[14]  Regina Y. Liu On a Notion of Data Depth Based on Random Simplices , 1990 .

[15]  A. Gordaliza,et al.  Robustness Properties of k Means and Trimmed k Means , 1999 .

[16]  Yuanyuan Ding,et al.  Improving the Performance of SVM-RFE to Select Genes in Microarray Data , 2006, BMC Bioinformatics.

[17]  Cun-Hui Zhang,et al.  The multivariate L1-median and associated data depth. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Rebecka Jörnsten,et al.  A Robust Clustering Method and Visualization Tool Based on Data Depth , 2002 .

[19]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[20]  R. Serfling,et al.  General notions of statistical depth function , 2000 .