论文信息 - Identify High-Quality Protein Structural Models by Enhanced K-Means

Identify High-Quality Protein Structural Models by Enhanced K-Means

Background. One critical issue in protein three-dimensional structure prediction using either ab initio or comparative modeling involves identification of high-quality protein structural models from generated decoys. Currently, clustering algorithms are widely used to identify near-native models; however, their performance is dependent upon different conformational decoys, and, for some algorithms, the accuracy declines when the decoy population increases. Results. Here, we proposed two enhanced K-means clustering algorithms capable of robustly identifying high-quality protein structural models. The first one employs the clustering algorithm SPICKER to determine the initial centroids for basic K-means clustering (SK-means), whereas the other employs squared distance to optimize the initial centroids (K-means++). Our results showed that SK-means and K-means++ were more robust as compared with SPICKER alone, detecting 33 (59%) and 42 (75%) of 56 targets, respectively, with template modeling scores better than or equal to those of SPICKER. Conclusions. We observed that the classic K-means algorithm showed a similar performance to that of SPICKER, which is a widely used algorithm for protein-structure identification. Both SK-means and K-means++ demonstrated substantial improvements relative to results from SPICKER and classical K-means.

[1] W. Kabsch. A discussion of the solution for the best rotation to relate two sets of vectors , 1978 .

[2] Daphne Koller,et al. Probabilistic hierarchical clustering for biological data , 2002, RECOMB '02.

[3] Zhu-Hong You,et al. Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis , 2013, BMC Bioinformatics.

[4] Dong Xu,et al. A Protocol for Computer-Based Protein Structure and Function Prediction , 2011, Journal of visualized experiments : JoVE.

[5] Hong-Bin Shen,et al. Template‐based protein structure prediction in CASP11 and retrospect of I‐TASSER in the last decade , 2016, Proteins.

[6] M. Levitt,et al. A unified statistical framework for sequence comparison and structure comparison. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[7] Ravinder Abrol,et al. Bihelix: Towards de novo structure prediction of an ensemble of G‐protein coupled receptor conformations , 2012, Proteins.

[8] De-Shuang Huang,et al. A Rayleigh-Ritz style method for large-scale discriminant analysis , 2014, Pattern Recognit..

[9] Andrzej Kolinski,et al. ClusCo: clustering and comparison of protein models , 2013, BMC Bioinformatics.

[10] Pierpaolo D'Urso,et al. A robust fuzzy k-means clustering model for interval valued data , 2006, Comput. Stat..

[11] Shuai Cheng Li,et al. Finding Nearly Optimal GDT Scores , 2011, J. Comput. Biol..

[12] L. Cazzanti,et al. Quality Assessment of Low Free-Energy Protein Structure Predictions , 2005, 2005 IEEE Workshop on Machine Learning for Signal Processing.

[13] Kathleen Marchal,et al. Evaluation of time profile reconstruction from complex two-color microarray designs , 2008, BMC Bioinformatics.

[14] Yong Zhou,et al. Durandal: Fast exact clustering of protein decoys , 2012, J. Comput. Chem..

[15] Claire Cardie,et al. Constrained K-means Clustering with Background Knowledge , 2001, ICML.

[16] A. Sali,et al. Protein Structure Prediction and Structural Genomics , 2001, Science.

[17] Jilong Li,et al. Large-scale model quality assessment for improving protein tertiary structure prediction , 2015, Bioinform..

[18] Dong Xu,et al. Fast Algorithm for Clustering a Large Number of Protein Structural Decoys , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[19] W. Goddard,et al. First principles prediction of protein folding rates. , 1999, Journal of molecular biology.

[20] Arne Elofsson,et al. MaxSub: an automated measure for the assessment of protein structure prediction quality , 2000, Bioinform..

[21] Luis Angel García-Escudero,et al. A review of robust clustering methods , 2010, Adv. Data Anal. Classif..

[22] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[23] Yu Xue,et al. Deep Conditional Random Field Approach to Transmembrane Topology Prediction and Application to GPCR Three-Dimensional Structure Modeling , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24] Wouter Boomsma,et al. Fast large-scale clustering of protein structures using Gauss integrals , 2012, Bioinform..

[25] David S. Wishart,et al. An improved method to detect correct protein folds using partial clustering , 2013, BMC Bioinformatics.

[26] Eric P. Xing,et al. Free Energy Estimates of All-Atom Protein Structures Using Generalized Belief Propagation , 2007, RECOMB.

[27] Yang Zhang,et al. SPICKER: A clustering approach to identify near‐native protein folds , 2004, J. Comput. Chem..

[28] Yang Zhang,et al. How significant is a protein structure similarity with TM-score = 0.5? , 2010, Bioinform..

[29] M. Karplus,et al. Structures and relative free energies of partially folded states of proteins , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[30] Jacques van Helden,et al. Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[31] Yizong Cheng,et al. Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[32] Dong Xu,et al. Fast algorithm for population‐based protein structural model analysis , 2013, Proteomics.

[33] Renzhi Cao,et al. UniCon3D: de novo protein structure prediction using united-residue conformational search via stepwise, probabilistic sampling , 2016, Bioinform..

[34] David T. Jones,et al. Using neural networks and evolutionary information in decoy discrimination for protein tertiary structure prediction , 2007, BMC Bioinformatics.

[35] Renzhi Cao,et al. Protein single-model quality assessment by feature-based probability density functions , 2016, Scientific Reports.

[36] Kuldip K. Paliwal,et al. Proposing a highly accurate protein structural class predictor using segmentation-based features , 2014, BMC Genomics.