Protein Structure Prediction and Clustering Using MUFOLD

The 3D structure of protein holds the key in understanding its biological function at the molecular level. Knowledge of protein structure also allows researchers to identify and characterize disease targets and provides a rational approach to drug design. In contrast to high-throughput DNA sequencing, structure determination is relative low-throughput because the experimental methods such as X-ray crystallography and NMR are costly, time-consuming and often technically difficult. Consequently, the gap between the number of known protein sequences and experimentally solved structures has been significantly widening. Computational prediction methods have been a very important alternative approach to protein structure solution, which can often generate useful structure models quickly at little cost. There have been steady improvements in protein structure prediction during the past two decades. However, current methods are still far from consistently predicting structural models accurately with computing power accessible to common users. To address this challenge, we developed MUFOLD, a hybrid method of using whole and partial template information along with new computational techniques for protein tertiary structure prediction. MUFOLD covers both template-based and ab initio predictions using the same framework and aims to achieve high accuracy and fast computing. In MUFOLD, the prediction problem is formulated as a graph-realization problem, and an efficient global optimization approach of multi-dimensional scaling (MDS) is employed to solve it. In this framework, as models are generated efficiently based on predicted distance matrices from template fragment matches, deeper and broader information from PDB can be utilized and the quality of models can be improved with evaluation/refinement processes. MUFOLD speeds up the prediction dramatically over conventional methods. In addition, MUFOLD enhances the predictions consistently by iteratively using the information from generated models. MUFOLD has demonstrated its effectiveness in CASPs and various applications. Current protein structure prediction methods, including MUFOLD, often generate a large population of candidates (models), and then select near-native models through clustering. Existing structural model clustering methods are time consuming due to pairwise distance calculation between models. To address this issue, we developed a novel method for fast model clustering without losing the clustering accuracy. Instead of the commonly used pairwise RMSD and TM-score values, we propose two new distance measures, Dscore1 and Dscore2, based on the comparison of the protein distance matrices for describing the difference and the similarity among models, respectively. The analysis indicates that the correlation between Dscore1 and RMSD or between Dscore2 and TM-score is high. Our Dscore1-based clustering achieves a calculation time linearly proportional to the number of models while obtaining almost the same accuracy for near-native model selection in comparison to existing methods with calculation time quadratic to the number of models. By using Dscore2 to select representatives of clusters, we can further improve the quality of the representatives with little increase in computing time. Our method has been implemented in a package named MUFOLD-CL, available at http://mufold.org/clustering.php. This work has been supported by National Institutes of Health Grant R21/R33-GM078601. Major computer time was provided by the University of Missouri Bioinformatics Consortium.