A Greedy Algorithm for Hierarchical Complete Linkage Clustering

We are interested in the greedy method to compute an hierarchical complete linkage clustering. There are two known methods for this problem, one having a running time of \({\mathcal O}(n^3)\) with a space requirement of \({\mathcal O}(n)\) and one having a running time of \({\mathcal O}(n^2 \log n)\) with a space requirement of Θ(n 2), where n is the number of points to be clustered. Both methods are not capable to handle large point sets. In this paper, we give an algorithm with a space requirement of \({\mathcal O}(n)\) which is able to cluster one million points in a day on current commodity hardware.

[1]  Jeffrey J. Gray,et al.  Protein-protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. , 2003, Journal of molecular biology.

[2]  Sabine C. Mueller,et al.  BALL - biochemical algorithms library 1.3 , 2010, BMC Bioinformatics.

[3]  Runze Li,et al.  How to cluster gene expression dynamics in response to environmental signals , 2012, Briefings Bioinform..

[4]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[5]  Tao Jiang,et al.  SEED: efficient clustering of next-generation sequences , 2011, Bioinform..

[6]  Ernst Althaus,et al.  Efficient computation of root mean square deviations under rigid transformations , 2014, J. Comput. Chem..

[7]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[8]  Nir Friedman,et al.  An integrative clustering and modeling algorithm for dynamical gene expression data , 2011, Bioinform..

[9]  Andrzej Kolinski,et al.  ClusCo: clustering and comparison of protein models , 2013, BMC Bioinformatics.

[10]  R. M. Cormack,et al.  A Review of Classification , 1971 .

[11]  Andrew E. Torda,et al.  Algorithms for clustering molecular dynamics configurations , 1994, J. Comput. Chem..

[12]  Baldomero Oliva,et al.  How different from random are docking predictions when ranked by scoring functions? , 2010, Proteins.

[13]  Ziv Bar-Joseph,et al.  Clustering short time series gene expression data , 2005, ISMB.

[14]  Vincent Miele,et al.  Ultra-fast sequence clustering from similarity networks with SiLiX , 2011, BMC Bioinformatics.

[15]  Shuai Cheng Li,et al.  Clustering 100,000 Protein Structure Decoys in Minutes , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[17]  D. Baker,et al.  Clustering of low-energy conformations near the native structures of small proteins. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Zechen Chong,et al.  Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads , 2012, Bioinform..