Reducing Ensembles of Protein Tertiary Structures Generated De Novo via Clustering

Controlling the quality of tertiary structures computed for a protein molecule remains a central challenge in de-novo protein structure prediction. The rule of thumb is to generate as many structures as can be afforded, effectively acknowledging that having more structures increases the likelihood that some will reside near the sought biologically-active structure. A major drawback with this approach is that computing a large number of structures imposes time and space costs. In this paper, we propose a novel clustering-based approach which we demonstrate to significantly reduce an ensemble of generated structures without sacrificing quality. Evaluations are related on both benchmark and CASP target proteins. Structure ensembles subjected to the proposed approach and the source code of the proposed approach are publicly-available at the links provided in Section 1.

[1]  Shuai Cheng Li,et al.  A tool for clustering large numbers of protein decoys , 2010 .

[2]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[3]  Daniel J. Rigden,et al.  From Protein Structure to Function with Bioinformatics , 2009 .

[4]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Gopinath Chennupati,et al.  Decoy selection for protein structure prediction via extreme gradient boosting and ranking , 2020, BMC Bioinformatics.

[6]  Shuai Cheng Li,et al.  Clustering 100,000 Protein Structure Decoys in Minutes , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Amarda Shehu,et al.  Balancing multiple objectives in conformation sampling to control decoy diversity in template-free protein structure prediction , 2019, BMC Bioinformatics.

[8]  A. D. McLachlan,et al.  A mathematical procedure for superimposing atomic coordinates of proteins , 1972 .

[9]  Xiaogen Zhou,et al.  Secondary Structure and Contact Guided Differential Evolution for Protein Structure Prediction , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Ruth Nussinov,et al.  Principles and Overview of Sampling Methods for Modeling Macromolecular Structure and Dynamics , 2016, PLoS Comput. Biol..

[11]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[12]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[13]  Kenneth A. De Jong,et al.  Off-lattice protein structure prediction with homologous crossover , 2013, GECCO '13.

[14]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[15]  Torsten Schwede,et al.  Assessment of model accuracy estimations in CASP12 , 2018, Proteins.

[16]  Amarda Shehu,et al.  Decoy Ensemble Reduction in Template-free Protein Structure Prediction , 2019, BCB.

[17]  Amarda Shehu,et al.  Interleaving Global and Local Search for Protein Motion Computation , 2015, ISBRA.

[18]  Ngaam J Cheung,et al.  De novo protein structure prediction using ultra-fast molecular dynamics simulation , 2018, PloS one.

[19]  W. Graham Richards,et al.  Ultrafast shape recognition to search compound databases for similar molecular shapes , 2007, J. Comput. Chem..

[20]  Carlotta Domeniconi,et al.  The Hubness Phenomenon in High-Dimensional Spaces , 2019, Association for Women in Mathematics Series.

[21]  Amarda Shehu,et al.  Probabilistic Search and Energy Guidance for Biased Decoy Sampling in Ab Initio Protein Structure Prediction , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  D. Baker,et al.  Clustering of low-energy conformations near the native structures of small proteins. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Amarda Shehu,et al.  A General, Adaptive, Roadmap-Based Algorithm for Protein Motion Computation , 2016, IEEE Transactions on NanoBioscience.

[24]  V. de Crécy-Lagard,et al.  Mining high-throughput experimental data to link gene and function. , 2011, Trends in biotechnology.

[25]  Rhiju Das,et al.  Four Small Puzzles That Rosetta Doesn't Solve , 2011, PloS one.

[26]  Nasrin Akhter,et al.  From Extraction of Local Structures of Protein Energy Landscapes to Improved Decoy Selection in Template-Free Protein Structure Prediction , 2018, Molecules.

[27]  X. Daura,et al.  Peptide Folding: When Simulation Meets Experiment , 1999 .

[28]  Yang Zhang,et al.  SPICKER: A clustering approach to identify near‐native protein folds , 2004, J. Comput. Chem..

[29]  Amarda Shehu,et al.  An Ab-initio tree-based exploration to enhance sampling of low-energy protein conformations , 2009, Robotics: Science and Systems.

[30]  Nasrin Akhter,et al.  An Energy Landscape Treatment of Decoy Selection in Template-Free Protein Structure Prediction , 2018, Comput..

[31]  Yang Zhang,et al.  Ab initio protein structure assembly using continuous structure fragments and optimized knowledge‐based force field , 2012, Proteins.

[32]  Amarda Shehu,et al.  Multi-Objective Stochastic Search for Sampling Local Minima in the Protein Energy Surface , 2013, BCB.

[33]  Amarda Shehu,et al.  Using Sequence-Predicted Contacts to Guide Template-free Protein Structure Prediction , 2019, BCB.

[34]  Amarda Shehu,et al.  Guiding the Search for Native-like Protein Conformations with an Ab-initio Tree-based Exploration , 2010, Int. J. Robotics Res..

[35]  Hongyi Zhou,et al.  DESTINI: A deep-learning approach to contact-driven protein structure prediction , 2019, Scientific Reports.

[36]  Renzhi Cao,et al.  Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13 , 2019, bioRxiv.

[37]  Kenneth A. De Jong,et al.  Using subpopulation EAs to map molecular structure landscapes , 2019, GECCO.

[38]  Yang Zhang,et al.  Template‐based and free modeling of I‐TASSER and QUARK pipelines using predicted contact maps in CASP12 , 2018, Proteins.

[39]  G. Krishna,et al.  Agglomerative clustering using the concept of mutual nearest neighbourhood , 1978, Pattern Recognit..

[40]  Z. Luthey-Schulten,et al.  Ab initio protein structure prediction. , 2002, Current opinion in structural biology.

[41]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[42]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[43]  Pasi Fränti,et al.  Knee Point Detection in BIC for Detecting the Number of Clusters , 2008, ACIVS.

[44]  Haruki Nakamura,et al.  Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[45]  M. Levitt Nature of the protein universe , 2009, Proceedings of the National Academy of Sciences.

[46]  Brian S. Olson,et al.  Multi-Objective Optimization Techniques for Conformational Sampling in Template-Free Protein Structure Prediction , 2014 .

[47]  Nasrin Akhter,et al.  Graph-Based Community Detection for Decoy Selection in Template-Free Protein Structure Prediction , 2019, Molecules.

[48]  Li Yu,et al.  Enhancing Protein Conformational Space Sampling Using Distance Profile-Guided Differential Evolution , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[49]  Jens Meiler,et al.  ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. , 2011, Methods in enzymology.

[50]  Andriy Kryshtafovych,et al.  Assessment of contact predictions in CASP12: Co‐evolution and deep learning coming of age , 2017, Proteins.

[51]  D. Boehr,et al.  How Do Proteins Interact? , 2008, Science.

[52]  Amarda Shehu A Review of Evolutionary Algorithms for Computing Functional Conformations of Protein Molecules , 2015 .