Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R

SUMMARY Hierarchical clustering is a widely used method for detecting clusters in genomic data. Clusters are defined by cutting branches off the dendrogram. A common but inflexible method uses a constant height cutoff value; this method exhibits suboptimal performance on complicated dendrograms. We present the Dynamic Tree Cut R package that implements novel dynamic branch cutting methods for detecting clusters in a dendrogram depending on their shape. Compared to the constant height cutoff method, our techniques offer the following advantages: (1) they are capable of identifying nested clusters; (2) they are flexible-cluster shape parameters can be tuned to suit the application at hand; (3) they are suitable for automation; and (4) they can optionally combine the advantages of hierarchical clustering and partitioning around medoids, giving better detection of outliers. We illustrate the use of these methods by applying them to protein-protein interaction network data and to a simulated gene expression data set. AVAILABILITY The Dynamic Tree Cut method is implemented in an R package available at http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/BranchCutting.

[1]  S. Horvath,et al.  Evidence for anti-Burkitt tumour globulins in Burkitt tumour patients and healthy individuals. , 1967, British Journal of Cancer.

[2]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[3]  Anbupalam Thalamuthu,et al.  Gene expression Evaluation and comparison of gene clustering methods in microarray analysis , 2006 .

[4]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[5]  Mark J. van der Laan,et al.  A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap , 2003 .

[6]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[7]  Andy M. Yip,et al.  Gene network interconnectedness and the generalized topological overlap measure , 2007, BMC Bioinformatics.

[8]  S. Horvath,et al.  Gene connectivity, function, and sequence conservation: predictions from modular yeast co-expression networks , 2006, BMC Genomics.

[9]  Debashis Ghosh,et al.  Mixture modelling of gene expression data from microarray experiments , 2002, Bioinform..

[10]  Jun Dong,et al.  Understanding network concepts in modules , 2007, BMC Systems Biology.

[11]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[12]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[13]  A. Cuevas,et al.  Estimating the number of clusters , 2000 .

[14]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.