Clustering FunFams using sequence embeddings improves EC purity

Motivation Classifying proteins into functional families can improve our understanding of a protein’s function and can allow transferring annotations within the same family. Toward this end, functional families need to be “pure”, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function, based on differentially conserved residues. 11% of all FunFams (22,830 of 203,639) also contain EC annotations and of those, 7% (1,526 of 22,830) have at least two different EC annotations, i.e., inconsistent functional annotations. Results We propose an approach to further cluster FunFams into smaller and functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from deep learned language models (LMs) transferring the knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between sequences in embedding space and DBSCAN to cluster FunFams, as well as identify outlier sequences, resulted in twice as many more pure clusters per FunFam than for a random clustering. 52% of the impure FunFams were split into pure clusters, four times more than for random. While functional consistency was mainly measured using EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other definitions of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency can be used to infer annotations more reliably. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. Availability The source code and PB-Tucker embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering

[1]  Kevin K. Yang,et al.  Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets , 2021, Current protocols.

[2]  Radka Svobodová Vareková,et al.  CATH: increased structural coverage of functional space , 2020, Nucleic Acids Res..

[3]  Sayoni Das,et al.  CATH functional families predict functional sites in proteins , 2020, Bioinform..

[4]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[5]  Tom Rainforth CONTRASTIVE REPRESENTATION LEARNING , 2021 .

[6]  Alan F. Smeaton,et al.  Contrastive Representation Learning: A Framework and Review , 2020, IEEE Access.

[7]  Burkhard Rost,et al.  Embeddings from deep learning transfer GO annotations beyond homology , 2020, Scientific Reports.

[8]  B. Rost,et al.  ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing , 2020, bioRxiv.

[9]  Lav R. Varshney,et al.  BERTology Meets Biology: Interpreting Attention in Protein Language Models , 2020, bioRxiv.

[10]  Anne Morgat,et al.  UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase , 2020, Bioinformatics.

[11]  Stavros Makrodimitris,et al.  Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function , 2020, bioRxiv.

[12]  Sayoni Das,et al.  CATH functional families predict protein functional sites , 2020, bioRxiv.

[13]  Nikhil Naik,et al.  ProGen: Language Modeling for Protein Generation , 2020, bioRxiv.

[14]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[15]  Anne Morgat,et al.  Enzyme annotation in UniProtKB using Rhea , 2019, bioRxiv.

[16]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[17]  Tapio Salakoski,et al.  The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens , 2019, Genome Biology.

[18]  Burkhard Rost,et al.  FunFam protein families improve residue level molecular function prediction , 2019, BMC Bioinformatics.

[19]  George M. Church,et al.  Unified rational protein engineering with sequence-only deep representation learning , 2019, bioRxiv.

[20]  Milot Mirdita,et al.  HH-suite3 for fast remote homology detection and deep protein annotation , 2019, BMC Bioinformatics.

[21]  Ian Sillitoe,et al.  CATH: expanding the horizons of structure-based functional annotations for genome sequences , 2018, Nucleic Acids Res..

[22]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Johannes Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, Nature Methods.

[25]  Zachary Wu,et al.  Learned protein embeddings for machine learning , 2018, Bioinform..

[26]  Guoyin Wang,et al.  Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms , 2018, ACL.

[27]  Neera Borkakoti,et al.  Ranking Enzyme Structures in the PDB by Bound Ligand Similarity to Biological Substrates , 2018, Structure.

[28]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[29]  Iddo Friedberg,et al.  Identifying antimicrobial peptides using word embedding with deep recurrent neural networks , 2018, bioRxiv.

[30]  Johannes Söding,et al.  Clustering huge protein sequence sets in linear time , 2017, Nature Communications.

[31]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[32]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[33]  David A. Lee,et al.  Functional classification of CATH superfamilies: a domain-based approach for protein function annotation , 2015, Bioinform..

[34]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[35]  David A. Lee,et al.  Functional classification of CATH superfamilies: a domain-based approach for protein function annotation , 2015, Bioinform..

[36]  David A. Lee,et al.  CATH FunFHMMer web server: protein functional annotations using functional family assignments , 2015, Nucleic Acids Res..

[37]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[38]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[39]  David A. Lee,et al.  New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures , 2012, Nucleic Acids Res..

[40]  Yang Zhang,et al.  BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions , 2012, Nucleic Acids Res..

[41]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[42]  Dan S. Tawfik,et al.  Enzyme promiscuity: a mechanistic and evolutionary perspective. , 2010, Annual review of biochemistry.

[43]  David A. Lee,et al.  GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains , 2009, Nucleic acids research.

[44]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[45]  Mona Singh,et al.  Characterization and prediction of residues determining protein functional specificity , 2008, Bioinform..

[46]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[47]  Burkhard Rost,et al.  CHOP proteins into structural domain‐like fragments , 2004, Proteins.

[48]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[49]  Constance J Jeffery,et al.  Moonlighting proteins: old proteins learning new tricks. , 2003, Trends in genetics : TIG.

[50]  J. R. Scotti,et al.  Available From , 1973 .

[51]  R. Russell,et al.  Analysis and prediction of functional sub-types from protein sequence alignments. , 2000, Journal of molecular biology.

[52]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[53]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[54]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[55]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[56]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[57]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[58]  E. Webb Enzyme nomenclature 1992. Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes. , 1992 .

[59]  A. M. B. DOUGLAS,et al.  X-Ray Crystallography , 1947, Nature.

[60]  M. Nadeau,et al.  Proteins : Structure , Function , and Bioinformatics , 2022 .