IDP-Seq2Seq: identification of intrinsically disordered regions based on sequence to sequence learning

MOTIVATION Related to many important biological functions, intrinsically disordered regions (IDRs) are widely distributed in proteins. Accurate prediction of intrinsically disordered regions is critical for the protein structure and function analysis. However, the existing computational methods construct the predictive models solely in the sequence space, failing to convert the sequence space into the "semantic space" to reflect the structure characteristics of proteins. Furthermore, although the length-dependent predictors showed promising results, new fusion strategies should be explored to improve their predictive performance and the generalization. RESULTS In this study, we applied the Sequence to Sequence Learning (Seq2Seq) derived from natural language processing (NLP) to map protein sequences to "semantic space" to reflect the structure patterns with the help of predicted Residue-Residue Contacts (CCMs) and other sequence-based features. Furthermore, the Attention mechanism was employed to capture the global associations between all residue pairs in the proteins. Three length-dependent predictors were constructed: IDP-Seq2Seq-L for long disordered region prediction, IDP-Seq2Seq-S for short disordered region prediction, and IDP-Seq2Seq-G for both long and short disordered region prediction. Finally, these three predictors were fused into one predictor called IDP-Seq2Seq to improve the discriminative power and generalization. Experimental results on four independent test datasets and the CASP test dataset showed that IDP-Seq2Seq is insensitive with the ratios of long and short disordered regions and outperforms other competing methods. AVAILABILITY For the convenience of most experimental scientists, a user-friendly and publicly accessible web-server for the powerful new predictor has been established at http://bliulab.net/IDP-Seq2Seq/. It is anticipated that IDP-Seq2Seq will become a very useful tool for identification of intrinsically disordered regions. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  P. Romero,et al.  Sequence complexity of disordered protein , 2001, Proteins.

[2]  Markus Gruber,et al.  CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations , 2014, Bioinform..

[3]  Lizhen Liu,et al.  A Deep Neural Network Model for Joint Entity and Relation Extraction , 2019, IEEE Access.

[4]  Yutaka Kuroda,et al.  POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions , 2007, Bioinform..

[5]  Junjie Chen,et al.  ProtDec-LTR2.0: an improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank , 2017, Bioinform..

[6]  Avner Schlessinger,et al.  Natively Unstructured Loops Differ from Other Loops , 2007, PLoS Comput. Biol..

[7]  Aleksey A. Porollo,et al.  Linear Regression Models for Solvent Accessibility Prediction in Proteins , 2005, J. Comput. Biol..

[8]  Bin Liu,et al.  IDP–CRF: Intrinsically Disordered Protein/Region Identification Based on Conditional Random Fields , 2018, International journal of molecular sciences.

[9]  Anna Tramontano,et al.  Evaluation of disorder predictions in CASP9 , 2011, Proteins.

[10]  D. Searls,et al.  Robots in invertebrate neuroscience , 2002, Nature.

[11]  Frank Eisenhaber,et al.  A Decade after the First Full Human genome sequencing: when will We Understand our Own genome? , 2012, J. Bioinform. Comput. Biol..

[12]  Zsuzsanna Dosztányi,et al.  IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content , 2005, Bioinform..

[13]  Avner Schlessinger,et al.  PROFbval: predict flexible and rigid residues in proteins , 2006, Bioinform..

[14]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[15]  J. S. Sodhi,et al.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. , 2004, Journal of molecular biology.

[16]  Quan Zou,et al.  Exploratory Predicting Protein Folding Model with Random Forest and Hybrid Features , 2014 .

[17]  Michael Levitt,et al.  The language of the protein universe. , 2015, Current opinion in genetics & development.

[18]  H. Dyson,et al.  Intrinsically unstructured proteins and their functions , 2005, Nature Reviews Molecular Cell Biology.

[19]  David T. Jones,et al.  DISOPRED3: precise disordered region predictions with annotated protein-binding activity , 2014, Bioinform..

[20]  Liam J. McGuffin,et al.  Intrinsic disorder prediction from the analysis of multiple protein fold recognition models , 2008, Bioinform..

[21]  H. Abdi,et al.  Principal component analysis , 2010 .

[22]  Pierre Baldi,et al.  Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data , 2005, Data Mining and Knowledge Discovery.

[23]  Avner Schlessinger,et al.  Natively unstructured regions in proteins identified from contact predictions , 2007, Bioinform..

[24]  Sonia Longhi,et al.  DisProt 7.0: a major update of the database of disordered proteins , 2016, Nucleic Acids Res..

[25]  Ronesh Sharma,et al.  OPAL: prediction of MoRF regions in intrinsically disordered protein sequences , 2018, Bioinform..

[26]  Fernanda L. Sirota,et al.  Parameterization of disorder predictors for large-scale applications requiring high specificity by using an extended benchmark dataset , 2010, BMC Genomics.

[27]  Lukasz A. Kurgan,et al.  Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources , 2010, Bioinform..

[28]  Christopher J. Oldfield,et al.  Intrinsically disordered proteins in human diseases: introducing the D2 concept. , 2008, Annual review of biophysics.

[29]  A Keith Dunker,et al.  SPINE-D: Accurate Prediction of Short and Long Disordered Regions by a Single Neural-Network Based Method , 2012, Journal of biomolecular structure & dynamics.

[30]  Anna Tramontano,et al.  Assessment of protein disorder region predictions in CASP10 , 2014, Proteins.

[31]  Xingpeng Jiang,et al.  Sequence clustering in bioinformatics: an empirical study. , 2018, Briefings in bioinformatics.

[32]  L. Iakoucheva,et al.  Intrinsic disorder in cell-signaling and cancer-associated proteins. , 2002, Journal of molecular biology.

[33]  Avner Schlessinger,et al.  Improved Disorder Prediction by Combination of Orthogonal Approaches , 2009, PloS one.

[34]  Lukasz Kurgan,et al.  Comprehensive comparative assessment of in-silico predictors of disordered regions. , 2012, Current protein & peptide science.

[35]  Zoran Obradovic,et al.  Length-dependent prediction of protein intrinsic disorder , 2006, BMC Bioinformatics.

[36]  Xiaolong Wang,et al.  RFPR-IDP: reduce the false positive rates for intrinsically disordered protein and region prediction by incorporating both fully ordered proteins and disordered proteins , 2020, Briefings Bioinform..

[37]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[38]  Yaoqi Zhou,et al.  Improving protein disorder prediction by deep bidirectional long short‐term memory recurrent neural networks , 2016, Bioinform..

[39]  Jens Meiler,et al.  Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks , 2001 .

[40]  Bin Liu,et al.  Identification of Intrinsically Disordered Proteins and Regions by Length-Dependent Predictors Based on Conditional Random Fields , 2019, Molecular therapy. Nucleic acids.

[41]  A Keith Dunker,et al.  Unfoldomics of human diseases: linking protein intrinsic disorder with diseases , 2009, BMC Genomics.

[42]  Aleksey A. Porollo,et al.  Accurate prediction of solvent accessibility using neural networks–based regression , 2004, Proteins.

[43]  Ronesh Sharma,et al.  OPAL+: Length‐Specific MoRF Prediction in Intrinsically Disordered Protein Sequences , 2018, Proteomics.

[44]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[45]  James G. Lyons,et al.  SPIDER2: A Package to Predict Secondary Structure, Accessible Surface Area, and Main-Chain Torsional Angles by Deep Neural Networks. , 2017, Methods in molecular biology.

[46]  AbdiHervé,et al.  Principal Component Analysis , 2010, Essentials of Pattern Recognition.

[47]  P. Radivojac,et al.  PROTEINS: Structure, Function, and Bioinformatics Suppl 7:176–182 (2005) Exploiting Heterogeneous Sequence Properties Improves Prediction of Protein Disorder , 2022 .

[48]  Jianlin Cheng,et al.  A comprehensive overview of computational protein disorder prediction methods. , 2012, Molecular bioSystems.

[49]  Zheng Rong Yang,et al.  RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins , 2005, Bioinform..

[50]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..

[51]  Sheng Wang,et al.  AUCpreD: proteome-level protein disorder prediction by AUC-maximized deep convolutional neural fields , 2016, Bioinform..

[52]  Shuichi Hirose,et al.  BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm330 Structural bioinformatics , 2022 .

[53]  Roland L. Dunbrack,et al.  PONDR-FIT: a meta-predictor of intrinsically disordered amino acids. , 2010, Biochimica et biophysica acta.