Pooled variable scaling for cluster analysis

MOTIVATION Many popular clustering methods are not scale invariant because they are based on Euclidean distances. Even methods using scale invariant distances such as the Mahalanobis distance lose their scale invariance when combined with regularization and/or variable selection. Therefore, the results from these methods are very sensitive to the measurement units of the clustering variables. A simple way to achieve scale invariance is to scale the variables before clustering. However, scaling variables is a very delicate issue in cluster analysis: A bad choice of scaling can adversely affect the clustering results. On the other hand, reporting clustering results that depend on measurement units is not satisfactory. Hence, a safe and efficient scaling procedure is needed for applications in bioinformatics and medical sciences research. RESULTS We propose a new approach for scaling prior to cluster analysis based on the concept of pooled variance. Unlike available scaling procedures such as the standard deviation and the range, our proposed scale avoids dampening the beneficial effect of informative clustering variables. We confirm through an extensive simulation study and applications to well known real data examples that the proposed scaling method is safe and generally useful. Finally, we use our approach to cluster a high dimensional genomic dataset consisting of gene expression data for several specimens of breast cancer cells tissue obtained from human patients. AVAILABILITY An R-implementation of the algorithms presented is available at https://wis.kuleuven.be/statdatascience/robust/software. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[2]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[3]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[4]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[5]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[8]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[9]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[10]  Haizhou Wang,et al.  Ckmeans.1d.dp: Optimal k-means Clustering in One Dimension by Dynamic Programming , 2011, R J..

[11]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[12]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[14]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[16]  Harry Joe,et al.  Generation of Random Clusters with Specified Degree of Separation , 2006, J. Classif..

[17]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[18]  P. Green,et al.  An Empirical Comparison of Variable Standardization Methods in Cluster Analysis. , 1996, Multivariate behavioral research.

[19]  Edward R. Dougherty,et al.  Model-based evaluation of clustering validation measures , 2007, Pattern Recognit..

[20]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[21]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[22]  G. W. Milligan,et al.  An algorithm for generating artificial test clusters , 1985 .

[23]  Peter Bühlmann,et al.  Supervised clustering of genes , 2002, Genome Biology.

[24]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[25]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[26]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[27]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[28]  Sunduz Keles,et al.  Sparse Partial Least Squares Classification for High Dimensional Data , 2010, Statistical applications in genetics and molecular biology.

[29]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[30]  A. Feinberg,et al.  Increased methylation variation in epigenetic domains across cancer types , 2011, Nature Genetics.

[31]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[32]  Juha Vesanto,et al.  Importance of Individual Variables in the k -Means Algorithm , 2001, PAKDD.

[33]  A. M. Stoddard,et al.  Standardization of measures prior to cluster analysis. , 1979, Biometrics.

[34]  Douglas Steinley,et al.  Standardizing Variables in K -means Clustering , 2004 .