STATISTICAL ISSUES IN THE CLUSTERING OF GENE EXPRESSION DATA

This paper illustrates some of the problems which can occur in any data set when clustering samples of gene expression profiles. These include a possi- ble high degree of dependence of results on choice of clustering algorithm, further dependence of results on the choices of genes and samples to be included in the clustering (for example, whether or not to include control samples), and difficulty in assessing the validity of the grouping. We also demonstrate the use of Cox regression as a tool to identify genes influencing survival.

[1]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[2]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[3]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[4]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[5]  D. Cox,et al.  Analysis of Survival Data. , 1986 .

[6]  B. S. Everitt,et al.  Cluster analysis , 2014, Encyclopedia of Social Network Analysis and Mining.

[7]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[8]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[9]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[10]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[11]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[12]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[13]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[14]  Mark Schena,et al.  DNA microarrays : a practical approach , 1999 .

[15]  J. William Ahwood,et al.  CLASSIFICATION , 1931, Foundations of Familiar Language.

[16]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[17]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[18]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[19]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[20]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[21]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[22]  S. P. Fodor,et al.  High density synthetic oligonucleotide arrays , 1999, Nature Genetics.

[23]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[24]  Christian A. Rees,et al.  Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[25]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..