Sparse principal components by semi-partition clustering

A cluster-based method for constructing sparse principal components is proposed. The method initially forms clusters of variables, using a new clustering approach called the semi-partition, in two steps. First, the variables are ordered sequentially according to a criterion involving the correlations between variables. Then, the ordered variables are split into two parts based on their generalized variance. The first group of variables becomes an output cluster, while the second one—input for another run of the sequential process. After the optimal clusters have been formed, sparse components are constructed from the singular value decomposition of the data matrices of each cluster. The method is applied to simple data sets with smaller number of variables (p) than observations (n), as well as large gene expression data sets with p ≫ n. The resulting cluster-based sparse principal components are very promising as evaluated by objective criteria. The method is also compared with other existing approaches and is found to perform well.

[1]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Theo Gasser,et al.  Simple component analysis , 2004 .

[3]  B. Everitt,et al.  Applied Multivariate Data Analysis. , 1993 .

[4]  Elaine B. Martin,et al.  On principal component analysis in L 1 , 2002 .

[5]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[6]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[7]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[8]  I. Jolliffe Principal Component Analysis , 2002 .

[9]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[10]  J. N. R. Jeffers,et al.  Two Case Studies in the Application of Principal Component Analysis , 1967 .

[11]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[12]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[13]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[14]  Ian T. Jolliffe,et al.  Discarding Variables in a Principal Component Analysis. I: Artificial Data , 1972 .

[15]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[16]  Jorge Cadima Departamento de Matematica Loading and correlations in the interpretation of principle compenents , 1995 .

[17]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[18]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[19]  B. Everitt,et al.  Applied Multivariate Data Analysis: Everitt/Applied Multivariate Data Analysis , 2001 .

[20]  Anbupalam Thalamuthu,et al.  Gene expression Evaluation and comparison of gene clustering methods in microarray analysis , 2006 .

[21]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .