Alignment-free estimation of sequence conservation for identifying functional sites using protein sequence embeddings

Abstract Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However, application toward estimating sequence conservation for functional site prediction has not been systematically explored. Here, we present a method for the alignment-free estimation of sequence conservation using sequence embeddings generated from protein language models. Comprehensive benchmarks across publicly available protein language models reveal that ESM2 models provide the best performance to computational cost ratio for conservation estimation. Applying our method to full-length protein sequences, we demonstrate that embedding-based methods are not sensitive to the order of conserved elements—conservation scores can be calculated for multidomain proteins in a single run, without the need to separate individual domains. Our method can also identify conserved functional sites within fast-evolving sequence regions (such as domain inserts), which we demonstrate through the identification of conserved phosphorylation motifs in variable insert segments in protein kinases. Overall, embedding-based conservation analysis is a broadly applicable method for identifying potential functional sites in any full-length protein sequence and estimating conservation in an alignment-free manner. To run this on your protein sequence of interest, try our scripts at https://github.com/esbgkannan/kibby.

[1]  Hang Li Language models , 2022, Commun. ACM.

[2]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[3]  B. Berger,et al.  Learning the protein language: Evolution, structure, and function. , 2021, Cell systems.

[4]  M. Heinzinger,et al.  Embeddings from protein language models predict conservation and variant effects , 2021, Human Genetics.

[5]  Michal Linial,et al.  The language of proteins: NLP, machine learning & protein sequences , 2021, Computational and structural biotechnology journal.

[6]  Peter B. McGarvey,et al.  UniProt: the universal protein knowledgebase in 2021 , 2020, Nucleic Acids Res..

[7]  Silvio C. E. Tosatto,et al.  Pfam: The protein families database in 2021 , 2020, Nucleic Acids Res..

[8]  B. Rost,et al.  ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing , 2020, bioRxiv.

[9]  Narmada Thanki,et al.  CDD/SPARCLE: the conserved domain database in 2020 , 2019, Nucleic Acids Res..

[10]  N. Ben-Tal,et al.  ConSurf‐DB: An accessible repository for the evolutionary conservation patterns of the majority of PDB proteins , 2019, Protein science : a publication of the Protein Society.

[11]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[12]  Milot Mirdita,et al.  HH-suite3 for fast remote homology detection and deep protein annotation , 2019, BMC Bioinformatics.

[13]  R. Hendriks,et al.  Role of Bruton’s tyrosine kinase in B cells and malignancies , 2018, Molecular Cancer.

[14]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[15]  Dannie Durand,et al.  Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins , 2008, PLoS Comput. Biol..

[16]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[17]  Dae-Yeul Yu,et al.  Regulation of PDGF signalling and vascular remodelling by peroxiredoxin II , 2005, Nature.

[18]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[19]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[20]  Itay Mayrose,et al.  Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues , 2002, ISMB.

[21]  R. Vilella,et al.  Site-selective Dephosphorylation of the Platelet-derived Growth Factor β-Receptor by the Receptor-like Protein-tyrosine Phosphatase DEP-1* , 2000, The Journal of Biological Chemistry.

[22]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Jonathan A. Cooper,et al.  Phosphorylation sites in the PDGF receptor with different specificities for binding GAP and PI3 kinase in vivo. , 1992, The EMBO journal.

[24]  OUP accepted manuscript , 2021, Nucleic Acids Research.

[25]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.