fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data

Motivation: fast_protein_cluster is a fast, parallel and memory efficient package used to cluster 60 000 sets of protein models (with up to 550 000 models per set) generated by the Nutritious Rice for the World project. Results: fast_protein_cluster is an optimized and extensible toolkit that supports Root Mean Square Deviation after optimal superposition (RMSD) and Template Modeling score (TM-score) as metrics. RMSD calculations using a laptop CPU are 60× faster than qcprot and 3× faster than current graphics processing unit (GPU) implementations. New GPU code further increases the speed of RMSD and TM-score calculations. fast_protein_cluster provides novel k-means and hierarchical clustering methods that are up to 250× and 2000× faster, respectively, than Clusco, and identify significantly more accurate models than Spicker and Clusco. Availability and implementation: fast_protein_cluster is written in C++ using OpenMP for multi-threading support. Custom streaming Single Instruction Multiple Data (SIMD) extensions and advanced vector extension intrinsics code accelerate CPU calculations, and OpenCL kernels support AMD and Nvidia GPUs. fast_protein_cluster is available under the M.I.T. license. (http://software.compbio.washington.edu/fast_protein_cluster) Contact: lhhung@compbio.washington.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  W. Kabsch A solution for the best rotation to relate two sets of vectors , 1976 .

[2]  Shuai Cheng Li,et al.  A tool for clustering large numbers of protein decoys , 2010 .

[3]  Ram Samudrala,et al.  Accelerated protein structure comparison using TM-score-GPU , 2012, Bioinform..

[4]  Yang Zhang,et al.  Template‐based modeling and free modeling by I‐TASSER in CASP7 , 2007, Proteins.

[5]  Adam Zemla,et al.  LGA: a method for finding 3D similarities in protein structures , 2003, Nucleic Acids Res..

[6]  Andrzej Kolinski,et al.  ClusCo: clustering and comparison of protein models , 2013, BMC Bioinformatics.

[7]  Daniel Müllner,et al.  fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[8]  Yang Zhang,et al.  SPICKER: A clustering approach to identify near‐native protein folds , 2004, J. Comput. Chem..

[9]  Victor Guallar,et al.  pyRMSD: a Python package for efficient pairwise RMSD matrix calculation and handling , 2013, Bioinform..

[10]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[11]  Elon Portugaly,et al.  Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space , 2008, ISMB.

[12]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[13]  Arne Elofsson,et al.  MaxSub: an automated measure for the assessment of protein structure prediction quality , 2000, Bioinform..

[14]  R. Fontana,et al.  Minimum-Size Mixed-Level Orthogonal Fractional Factorial Designs Generation: A SAS-Based Algorithm , 2013 .

[15]  D. Theobald short communications Acta Crystallographica Section A Foundations of , 2005 .

[16]  Andrzej Kolinski,et al.  ClusCo: clustering and comparison of protein , 2013 .

[17]  Ram Samudrala,et al.  GPU-Q-J, a fast method for calculating root mean square deviation (RMSD) after optimal superposition , 2011, BMC Research Notes.