Array technologies have made it straightforward to monitor simultaneously the expression pattern of thousands of genes. The challenge now is to interpret such massive data sets. The first step is to extract the fundamental patterns of gene expression inherent in the data. This paper describes the application of self-organizing maps, a type of mathematical cluster analysis that is particularly well suited for recognizing and classifying features in complex, multidimensional data. The method has been implemented in a publicly available computer package, GENECLUSTER, that performs the analytical calculations and provides easy data visualization. To illustrate the value of such analysis, the approach is applied to hematopoietic differentiation in four well studied models (HL-60, U937, Jurkat, and NB4 cells). Expression patterns of some 6,000 human genes were assayed, and an online database was created. GENECLUSTER was used to organize the genes into biologically relevant clusters that suggest novel hypotheses about hematopoietic differentiation—for example, highlighting certain genes and pathways involved in ‘‘differentiation therapy’’ used in the treatment of acute promyelocytic leukemia. Array technologies have made it straightforward to monitor simultaneously the expression patterns of thousands of genes during cellular differentiation and response (1–5). The challenge now is to make sense of such massive data sets. For simple experiments comparing just two samples, it is enough to rank the genes by their relative induction. Richer experimental designs, however, could involve hundreds of samples— for example, complete developmental time courses in many cell lines. No two genes are likely to exhibit precisely the same response, and many distinct types of behavior may be present. A key goal is to extract the fundamental patterns of gene expression inherent in the data. Many mathematical techniques have been developed for identifying underlying patterns in complex data for such diverse applications as object recognition by machine vision systems, phoneme detection in speech processing, bandwidth compression in telecommunications, and signal classification in electrocardiography and sleep research (6–10). The techniques are essentially different ways to cluster points in multidimensional space. They can be directly applied to gene expression by regarding the quantitative expression levels of n genes in k samples as defining n points in k-dimensional space. Clustering Techniques. The question is, which clustering techniques are likely to be most useful for interpreting gene expression? One simple approach is to use direct visual inspection to group together genes with similar expression patterns. This approach was recently used by Cho et al. (4) to cluster genes whose expression correlated with particular phases of the cell cycle. The method is best suited for instances in which the patterns of interest are clear in advance (such as a periodic fluctuation in phase with the cell cycle), but it does not scale well to larger data sets and is less appropriate for discovering unexpected patterns. A common computational approach is hierarchical clustering (6–8). Data points are forced into a strict hierarchy of nested subsets: the closest pair of points is grouped and replaced by a single point representing their set average, the next closest pair of points is treated similarly, and so on. The data points are thus fashioned into a phylogenetic tree whose branch lengths represent the degree of similarity between the sets. Hierarchical clustering has recently been described for gene expression and has clearly proven valuable (11–13). Hierarchical clustering, however, has a number of shortcomings for the study of gene expression. Strict phylogenetic trees are best suited to situations of true hierarchical descent (such as in the evolution of species) and are not designed to reflect the multiple distinct ways in which expression patterns can be similar; this problem is exacerbated as the size and complexity of the data set grows. Hierarchical clustering has been noted by statisticians to suffer from lack of robustness, nonuniqueness, and inversion problems that complicate interpretation of the hierarchy (see ref. 14 for a detailed study). Finally, the deterministic nature of hierarchical clustering can cause points to be grouped based on local decisions, with no opportunity to reevaluate the clustering. It is known that the resulting trees can lock in accidental features, reflecting idiosyncrasies of the agglomeration rule. Various other clustering techniques are used in biological applications but have not yet been applied to the analysis of gene expression. These techniques include Bayesian clustering, k-means clustering, and self-organizing maps (SOMs). Bayesian clustering is a highly structured approach appropriate when a strong prior distribution on the data is available. k-means clustering is a completely unstructured approach, which proceeds in an entirely local fashion and produces an unorganized collection of clusters that is not conducive to interpretation. SOMs (9, 10) have a number of features that make them particularly well suited to clustering and analysis of gene expression patterns. They are ideally suited to exploratory data analysis, allowing one to impose partial structure on the clusters (in contrast to the rigid structure of hierarchical clustering, the strong prior hypotheses used in Bayesian clustering, and the nonstructure of k-means clustering) and facilitating easy visualization and interpretation. SOMs have good The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked ‘‘advertisement’’ in accordance with 18 U.S.C. §1734 solely to indicate this fact. PNAS is available online at www.pnas.org. Abbreviations: SOM, self-organizing maps; ATRA, all trans-retinoic acid; PMA, phorbol 12-myristate 13-acetate. ¶To whom reprint requests should be addressed at: Whitehead/ Massachusetts Institute of Technology Center for Genome Research, Building 300, 1 Kendall Square, Cambridge, MA 02139. e-mail: lander@genome.wi.mit.edu or golub@genome.wi.mit.edu.
[1]
M. V. Velzen,et al.
Self-organizing maps
,
2007
.
[2]
John A. Hartigan,et al.
Clustering Algorithms
,
1975
.
[3]
Robert F. Ling,et al.
Applied Multivariate Data Analysis, Vol. I: Regression and Experimental Design (J. D. Jobson)
,
1992,
SIAM Rev..
[4]
A. D. Gordon,et al.
Classification : Methods for the Exploratory Analysis of Multivariate Data
,
1981
.
[5]
Vincent Kanade,et al.
Clustering Algorithms
,
2021,
Wireless RF Energy Transfer in the Massive IoT Era.
[6]
J. D. Jobson,et al.
Categorical and multivariate methods
,
1992
.
[7]
R. Rosenfeld.
Nature
,
2009,
Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.