ProFitFun: a protein tertiary structure fitness function for quantifying the accuracies of model structures

MOTIVATION An accurate estimation of the quality of protein model structures typifies as a cornerstone in protein structure prediction regimes. Despite the recent groundbreaking success in the field of protein structure prediction, there are certain prospects for the improvement in model quality estimation at multiple stages of protein structure prediction and thus, to further push the prediction accuracy. Here, a novel approach, named ProFitFun, for assessing the quality of protein models is proposed by harnessing the sequence and structural features of experimental protein structures in terms of the preferences of backbone dihedral angles and relative surface accessibility of their amino acid residues at the tripeptide level. The proposed approach leverages upon the backbone dihedral angle and surface accessibility preferences of the residues by accounting for its N-terminal and C-terminal neighbors in the protein structure. These preferences are employed to evaluate protein structures through a machine learning approach and tested on an extensive dataset of diverse proteins. RESULTS The approach was extensively validated on a large test dataset (n = 25,005) of protein structures, comprising 23,661 models of 82 non-homologous proteins and 1,344 non-homologous experimental structures. Additionally, an external dataset of 40,000 models of 200 non-homologous proteins was also used for the validation of the proposed method. Both datasets were further employed for benchmarking the proposed method with four different state-of-the-art methods for protein structure quality assessment. In the benchmarking, the proposed method outperformed some state of the art methods in terms of Spearman's and Pearson's correlation coefficients, average GDT-TS loss, sum of z-scores, and average absolute difference of predictions over corresponding observed values. The high accuracy of the proposed approach promises a potential use of the sequence and structural features in computational protein design. AVAILABILITY http://github.com/KYZ-LSB/ProTerS-FitFun. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Charlotte M Deane,et al.  RFQAmodel: Random Forest Quality Assessment to identify a predicted protein structure in the correct fold , 2019, PloS one.

[2]  Jiaxiang Wu,et al.  When homologous sequences meet structural decoys: Accurate contact prediction by tFold in CASP14—(tFold for CASP14 contact prediction) , 2021, Proteins.

[3]  Dmitrij Frishman,et al.  STRIDE: a web server for secondary structure assignment from known atomic coordinates of proteins , 2004, Nucleic Acids Res..

[4]  Dong Xu,et al.  Toward optimal fragment generations for ab initio protein structure assembly , 2013, Proteins.

[5]  Liam J. McGuffin,et al.  Rapid model quality assessment for protein structure predictions using the comparison of multiple models without structural alignments , 2010, Bioinform..

[6]  Qiwen Dong,et al.  MQAPRank: improved global protein model quality assessment by learning-to-rank , 2017, BMC Bioinformatics.

[7]  B. K.C.Dukka,et al.  Recent advances in sequence-based protein structure prediction , 2017, Briefings Bioinform..

[8]  Adam Zemla,et al.  LGA: a method for finding 3D similarities in protein structures , 2003, Nucleic Acids Res..

[9]  Kliment Olechnovič,et al.  VoroMQA: Assessment of protein structure quality using interatomic contact areas , 2017, Proteins.

[10]  Torsten Schwede,et al.  Protein modeling: what happened to the "protein structure gap"? , 2013, Structure.

[11]  Karolis Uziela,et al.  ProQ2: estimation of model accuracy implemented in Rosetta , 2016, Bioinform..

[12]  Yang Zhang,et al.  How significant is a protein structure similarity with TM-score = 0.5? , 2010, Bioinform..

[13]  Yoshua Bengio,et al.  Deep convolutional networks for quality assessment of protein folds , 2018, Bioinform..

[14]  Jie Hou,et al.  DeepQA: improving the estimation of single protein model quality with deep belief networks , 2016, BMC Bioinformatics.

[15]  L. McGuffin,et al.  ModFOLD8: accurate global and local quality estimates for 3D protein models , 2021, Nucleic Acids Res..

[16]  Kam Y. J. Zhang,et al.  Error-estimation-guided rebuilding of de novo models increases the success rate of ab initio phasing. , 2012, Acta crystallographica. Section D, Biological crystallography.

[17]  Jianyi Yang,et al.  Improved protein structure prediction using predicted interresidue orientations , 2020, Proceedings of the National Academy of Sciences.

[18]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[19]  Kam Y. J. Zhang,et al.  Efficient Sampling in Fragment-Based Protein Structure Prediction Using an Estimation of Distribution Algorithm , 2013, PloS one.

[20]  Sergei Grudinin,et al.  VoroCNN: deep convolutional neural network built on 3D Voronoi tessellation of protein structures , 2021, Bioinform..

[21]  Asheesh Shanker,et al.  ProTSAV: A protein tertiary structure analysis and validation server. , 2016, Biochimica et biophysica acta.

[22]  Torsten Schwede,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XIII , 2019, Proteins.

[23]  Yong Zhou,et al.  Entropy-accelerated exact clustering of protein decoys , 2011, Bioinform..

[24]  Rahul Kaushik,et al.  Where Informatics Lags Chemistry Leads. , 2017, Biochemistry.

[25]  Arne Elofsson,et al.  ProQ3D: improved model quality assessments using deep learning , 2016, Bioinform..

[26]  Renzhi Cao,et al.  Protein single-model quality assessment by feature-based probability density functions , 2016, Scientific Reports.

[27]  Anna Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP) — round x , 2014, Proteins.

[28]  Sergei Grudinin,et al.  Smooth orientation-dependent scoring function for coarse-grained protein quality assessment , 2018, Bioinform..

[29]  Rahul Kaushik,et al.  From Ramachandran Maps to Tertiary Structures of Proteins. , 2015, The journal of physical chemistry. B.

[30]  Miao Sun,et al.  QAcon: single model quality assessment using protein structural and contact information with machine learning techniques , 2016, Bioinform..

[31]  Arne Elofsson,et al.  GraphQA: protein model quality assessment using graph convolutional networks , 2020, Bioinform..

[32]  P3CMQA: Single-Model Quality Assessment Using 3DCNN with Profile-Based Features , 2021, Bioengineering.

[33]  Nikos E. Mastorakis,et al.  Multilayer perceptron and neural networks , 2009 .

[34]  Jian Peng,et al.  Low-homology protein threading , 2010, Bioinform..

[35]  Daniel B. Roche,et al.  Toolbox for Protein Structure Prediction. , 2016, Methods in molecular biology.

[36]  Yang Zhang,et al.  3DRobot: automated generation of diverse and well-packed protein structure decoys , 2016, Bioinform..

[37]  Mohammed AlQuraishi,et al.  AlphaFold at CASP13 , 2019, Bioinform..

[38]  Steven E. Brenner,et al.  SCOPe: classification of large macromolecular structures in the structural classification of proteins—extended database , 2018, Nucleic Acids Res..

[39]  Sergei Grudinin,et al.  Protein model quality assessment using 3D oriented convolutional neural networks , 2018, bioRxiv.

[40]  Minkyung Baek,et al.  Assessment of protein model structure accuracy estimation in CASP13: Challenges in the era of deep learning , 2019, Proteins.

[41]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XII , 2018, Proteins.

[42]  Jooyoung Lee,et al.  SVMQA: support‐vector‐machine‐based protein single‐model quality assessment , 2017, Bioinform..

[43]  Kam Y. J. Zhang,et al.  A protein sequence fitness function for identifying natural and nonnatural proteins , 2020, Proteins: Structure, Function, and Bioinformatics.

[44]  Improved protein structure refinement guided by deep learning based accuracy estimation , 2021, Nature communications.

[45]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..