Principal component analysis for clustering gene expression data

MOTIVATION There is a great need to develop analytical methodology to analyze and to exploit the information contained in gene expression data. Because of the large number of genes and the complexity of biological networks, clustering is a useful exploratory technique for analysis of gene expression data. Other classical techniques, such as principal component analysis (PCA), have also been applied to analyze gene expression data. Using different data analysis techniques and different clustering algorithms to analyze the same data set can lead to very different conclusions. Our goal is to study the effectiveness of principal components (PCs) in capturing cluster structure. Specifically, using both real and synthetic gene expression data sets, we compared the quality of clusters obtained from the original data to the quality of clusters obtained after projecting onto subsets of the principal component axes. RESULTS Our empirical study showed that clustering with the PCs instead of the original variables does not necessarily improve, and often degrades, cluster quality. In particular, the first few PCs (which contain most of the variation in the data) do not necessarily capture most of the cluster structure. We also showed that clustering with PCs has different impact on different algorithms and different similarity metrics. Overall, we would not recommend PCA before clustering except in special circumstances.

[1]  I. Jolliffe Principal Component Analysis , 2002 .

[2]  Michael Y. Galperin,et al.  Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs) , 2000, Genome Biology.

[3]  Joshua M. Stuart,et al.  MICROARRAY EXPERIMENTS : APPLICATION TO SPORULATION TIME SERIES , 1999 .

[4]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[5]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[6]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[7]  Neal S. Holter,et al.  Fundamental patterns underlying gene expression profiles: simplicity from complexity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[8]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[9]  L. Hubert,et al.  Comparing partitions , 1985 .

[10]  W. L. Ruzzo,et al.  An empirical study on Principal Component Analysis for clustering gene expression data , 2000 .

[11]  G. Reinsel,et al.  Introduction to Mathematical Statistics (4th ed.). , 1980 .

[12]  E. Lander Array of hope , 1999, Nature Genetics.

[13]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[14]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[15]  David W. Aha,et al.  A Comparative Evaluation of Sequential Feature Selection Algorithms , 1995, AISTATS.

[16]  Roger E Bumgarner,et al.  Comparative hybridization of an array of 21,500 ovarian cDNAs for the discovery of genes overexpressed in ovarian carcinomas. , 1999, Gene.

[17]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[18]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[20]  David J. Goodman,et al.  Personal Communications , 1994, Mobile Communications.

[21]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Robert V. Hogg,et al.  Introduction to Mathematical Statistics. , 1966 .

[23]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Wei-Chien Chang On using Principal Components before Separating a Mixture of Two Multivariate Normal Distributions , 1983 .