Points of Significance: Clustering

Many biological analyses involve partitioning samples or variables into clusters on the basis of similarity or its converse, distance. For example, in a gene expression study, we might seek subsets of patients with similar expression, or take a complementary approach and identify similarly expressed genes across patients. Clustering is a type of unsupervised learning comprising many different methods1. Here we will focus on two common methods: hierarchical clustering2, which can use any similarity measure, and k-means clustering3, which uses Euclidean or correlation distance. Fundamentally, all clustering methods apply the same approach. First, we calculate similarity and then use it to group objects (e.g., samples) into clusters. However, the clustering output is useful only if the clusters correspond to the data’s biologically relevant features that were not used to define the grouping. To judge clusters’ validity, we need external information; clusters are not known in advance. For example, our confidence in the validity of our clusters increases if patients in each cluster share a phenotype, or if genes in each cluster share a sequence motif; but confidence increases only if this information was not used to assess similarity in the first place. Let’s look at how similarity can be calculated. Suppose we have expression profiles for five genes, A-E, across n = 15 patients, and we want to compare these profiles to a reference profile (Fig. 1). A visual assessment may be misleading. The difference in expression relative to the reference is smaller for gene B than for gene A, for example, and this might make us think that the gene B profile is more similar to the reference than the gene A profile. However, gene B has a completely different pattern of expression (constant) than that of the reference, while gene A has the same pattern as the reference but with an offset. While there are many ways to calculate the similarity of two such profiles, including subjective measures, we use the common geometric notion of Euclidian distance expressed as the root mean square (r.m.s.; Fig. 1a). This quantity includes a factor of 1/√n to avoid dependency solely on n, such as for profiles that differ by only a constant offset. Similarity can be expressed as |c – r.m.s.| (where c is some constant such as the maximum distance between objects), so that objects with distance c or greater have zero similarity. Practically, similarity in expression should be based on varying regulation and not absolute abundance. To emphasize regulation, we can center the expression values by subtracting the profile’s mean from each of its expression values (Fig. 1b). To focus on the pattern rather than magnitude of regulation, one can first convert profiles to z-scores, which give the variation from the mean in units of s.d. (Fig. 1c). The r.m.s. between z-score profiles is 1 – r, where r is the correlation of the profiles—those perfectly correlated have r.m.s. = 0. Distance may be defined as 1 – |r| to cluster genes with opposing regulation, such as gene D, which is perfectly negatively correlated with the reference (Fig. 1c). When using correlation distance, it is common to filter out samples with very low variance, where the pattern may be due to chance. Once we have the similarity between objects, we group them into clusters. In hierarchical clustering, the nodes start off as objects and are then iteratively merged on the basis of pairwise distance (Fig. 2a). There are many ways of calculating this distance, but the most common methods are complete linkage clustering and single linkage clustering, which return the maximum or minimum, respectively, of all pairwise distances of objects between nodes. The clustering is typically depicted by a dendrogram, where the height of the branches is either the step at which the nodes were merged or the distance between them (Fig. 2b). Clusters are formed by partitioning of the dendrogram—for example, by cutting it at a fixed height and considering each of the resulting subtrees as a cluster. Membership in clusters depends on both the cutoff and similarity measures (Fig. 3). Alternatively, clusters can be made with selective cuts informed by underlying biology to find visually pleasing groups. When comparing two dendrograms, take into account that the order of branches in a dendrogram is arbitrary. Nodes that are near each other (e.g., profiles D5 and E5 in Fig. 3b) may lose their spatial adjacency with a single branch flip. In contrast to hierarchical clustering, k-means clustering requires that we first choose the number of clusters, k. In Figure 4a we illustrate this process using k = 3 and a simulated two-dimensional data set with points randomly placed in three adjoining areas (gray Figure 1 | Similarity measures between expression profiles across n = 15 patients (dots) of five putative genes (blue) and a reference (gray). (a) Absolute expression profiles of genes A–E generated by various transformations from the reference. Their similarity to the reference is shown as the Euclidian distance expressed as root mean square (r.m.s.). Gene C is most similar to the reference (r.m.s. = 0.76), followed by gene B (r.m.s. = 1.52). (b) Profiles from a centered on their means and corresponding r.m.s. Gene A and reference profiles now overlap (r.m.s. = 0), and the similarity of gene E to the reference has decreased to be the same as that of gene C (r.m.s. = 0.76). (c) Profiles from a transformed into z-scores. Gene B has no profile because the z-score is undefined when no variation is present. Figure 2 | Complete linkage clustering of five objects. (a) Pairwise distances (step 1) are used to merge objects (steps 2–4) where the maximum of all pairwise distances is used. At each merging step, the shortest distance is chosen (blue). (b) A dendrogram with a vertical axis showing the distance between merged nodes. To create clusters, one can cut the tree at a fixed height (dashed line). a b