Identify High-Quality Protein Structural Models by Enhanced K-Means

Background. One critical issue in protein three-dimensional structure prediction using either ab initio or comparative modeling involves identification of high-quality protein structural models from generated decoys. Currently, clustering algorithms are widely used to identify near-native models; however, their performance is dependent upon different conformational decoys, and, for some algorithms, the accuracy declines when the decoy population increases. Results. Here, we proposed two enhanced K-means clustering algorithms capable of robustly identifying high-quality protein structural models. The first one employs the clustering algorithm SPICKER to determine the initial centroids for basic K-means clustering (SK-means), whereas the other employs squared distance to optimize the initial centroids (K-means++). Our results showed that SK-means and K-means++ were more robust as compared with SPICKER alone, detecting 33 (59%) and 42 (75%) of 56 targets, respectively, with template modeling scores better than or equal to those of SPICKER. Conclusions. We observed that the classic K-means algorithm showed a similar performance to that of SPICKER, which is a widely used algorithm for protein-structure identification. Both SK-means and K-means++ demonstrated substantial improvements relative to results from SPICKER and classical K-means.

[1]  W. Kabsch A discussion of the solution for the best rotation to relate two sets of vectors , 1978 .

[2]  Daphne Koller,et al.  Probabilistic hierarchical clustering for biological data , 2002, RECOMB '02.

[3]  Zhu-Hong You,et al.  Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis , 2013, BMC Bioinformatics.

[4]  Dong Xu,et al.  A Protocol for Computer-Based Protein Structure and Function Prediction , 2011, Journal of visualized experiments : JoVE.

[5]  Hong-Bin Shen,et al.  Template‐based protein structure prediction in CASP11 and retrospect of I‐TASSER in the last decade , 2016, Proteins.

[6]  M. Levitt,et al.  A unified statistical framework for sequence comparison and structure comparison. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Ravinder Abrol,et al.  Bihelix: Towards de novo structure prediction of an ensemble of G‐protein coupled receptor conformations , 2012, Proteins.

[8]  De-Shuang Huang,et al.  A Rayleigh-Ritz style method for large-scale discriminant analysis , 2014, Pattern Recognit..

[9]  Andrzej Kolinski,et al.  ClusCo: clustering and comparison of protein models , 2013, BMC Bioinformatics.

[10]  Pierpaolo D'Urso,et al.  A robust fuzzy k-means clustering model for interval valued data , 2006, Comput. Stat..

[11]  Shuai Cheng Li,et al.  Finding Nearly Optimal GDT Scores , 2011, J. Comput. Biol..

[12]  L. Cazzanti,et al.  Quality Assessment of Low Free-Energy Protein Structure Predictions , 2005, 2005 IEEE Workshop on Machine Learning for Signal Processing.

[13]  Kathleen Marchal,et al.  Evaluation of time profile reconstruction from complex two-color microarray designs , 2008, BMC Bioinformatics.

[14]  Yong Zhou,et al.  Durandal: Fast exact clustering of protein decoys , 2012, J. Comput. Chem..

[15]  Claire Cardie,et al.  Constrained K-means Clustering with Background Knowledge , 2001, ICML.

[16]  A. Sali,et al.  Protein Structure Prediction and Structural Genomics , 2001, Science.

[17]  Jilong Li,et al.  Large-scale model quality assessment for improving protein tertiary structure prediction , 2015, Bioinform..

[18]  Dong Xu,et al.  Fast Algorithm for Clustering a Large Number of Protein Structural Decoys , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[19]  W. Goddard,et al.  First principles prediction of protein folding rates. , 1999, Journal of molecular biology.

[20]  Arne Elofsson,et al.  MaxSub: an automated measure for the assessment of protein structure prediction quality , 2000, Bioinform..

[21]  Luis Angel García-Escudero,et al.  A review of robust clustering methods , 2010, Adv. Data Anal. Classif..

[22]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[23]  Yu Xue,et al.  Deep Conditional Random Field Approach to Transmembrane Topology Prediction and Application to GPCR Three-Dimensional Structure Modeling , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Wouter Boomsma,et al.  Fast large-scale clustering of protein structures using Gauss integrals , 2012, Bioinform..

[25]  David S. Wishart,et al.  An improved method to detect correct protein folds using partial clustering , 2013, BMC Bioinformatics.

[26]  Eric P. Xing,et al.  Free Energy Estimates of All-Atom Protein Structures Using Generalized Belief Propagation , 2007, RECOMB.

[27]  Yang Zhang,et al.  SPICKER: A clustering approach to identify near‐native protein folds , 2004, J. Comput. Chem..

[28]  Yang Zhang,et al.  How significant is a protein structure similarity with TM-score = 0.5? , 2010, Bioinform..

[29]  M. Karplus,et al.  Structures and relative free energies of partially folded states of proteins , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Jacques van Helden,et al.  Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[31]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Dong Xu,et al.  Fast algorithm for population‐based protein structural model analysis , 2013, Proteomics.

[33]  Renzhi Cao,et al.  UniCon3D: de novo protein structure prediction using united-residue conformational search via stepwise, probabilistic sampling , 2016, Bioinform..

[34]  David T. Jones,et al.  Using neural networks and evolutionary information in decoy discrimination for protein tertiary structure prediction , 2007, BMC Bioinformatics.

[35]  Renzhi Cao,et al.  Protein single-model quality assessment by feature-based probability density functions , 2016, Scientific Reports.

[36]  Kuldip K. Paliwal,et al.  Proposing a highly accurate protein structural class predictor using segmentation-based features , 2014, BMC Genomics.