Machine learning models for predicting protein condensate formation from sequence determinants and embeddings

Intracellular phase separation of proteins into biomolecular condensates is increasingly recognised as an important phenomenon for cellular compartmentalisation and regulation of biological function. Different hypotheses about the parameters that determine the tendency of proteins to form condensates have been proposed with some of them probed experimentally through the use of constructs generated by sequence alterations. To broaden the scope of these observations, here, we established an in silico strategy for understanding on a global level the associations between protein sequence and condensate formation, and used this information to construct machine learning classifiers for predicting liquid–liquid phase separation (LLPS) from protein sequence. Our analysis highlighted that LLPS–prone sequences are more disordered, hydrophobic and of lower Shannon entropy than sequences in the Protein Data Bank or the Swiss-Prot database, and have their disordered regions enriched in polar, aromatic and charged residues. Using these determining features together with neural network based word2vec sequence embeddings, we developed machine learning classifiers for predicting protein condensate formation. Our model, trained to distinguish LLPS-prone sequences from structured proteins, achieved high accuracy (93%; 25-fold cross-validation) and identified condensate forming sequences from external independent test data at 97% sensitivity. Moreover, in combination with a classifier that had developed a nuanced insight into the features governing protein phase behaviour by learning to distinguish between sequences of varying LLPS propensity, the sensitivity was supplemented with high specificity (approximated ROC–AUC of 0.85). These results provide a platform rooted in molecular principles for understanding protein phase behaviour. The predictor is accessible from https://deephase.ch.cam.ac.uk/. Significance Statement The tendency of many cellular proteins to form protein-rich biomolecular condensates underlies the formation of subcellular compartments and has been linked to various physiological functions. Understanding the molecular basis of this fundamental process and predicting protein phase behaviour have therefore become important objectives. To develop a global understanding of how protein sequence determines its phase behaviour, here, we constructed bespoke datasets of proteins of varying phase separation propensity and identified explicit biophysical and sequence-specific features common to phase separating proteins. Moreover, by combining this insight with neural network based sequence embeddings, we trained machine learning classifiers that identified phase separating sequences with high accuracy, including from independent external test data. The predictor is available from https://deephase.ch.cam.ac.uk/.

[1]  C. Brangwynne,et al.  The disordered P granule protein LAF-1 drives phase separation into droplets with tunable viscosity and dynamics , 2015, Proceedings of the National Academy of Sciences.

[2]  S. Alberti,et al.  Liquid-Liquid Phase Separation in Disease. , 2019, Annual review of genetics.

[3]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4]  H. Tenhu,et al.  Molecular crowding facilitates assembly of spidroin-like proteins through phase separation , 2019, European Polymer Journal.

[5]  A. Piruska,et al.  Enhanced transcription rates in membrane-free protocells formed by coacervation of cell lysate , 2013, Proceedings of the National Academy of Sciences.

[6]  Youjun Xu,et al.  Prediction of liquid-liquid phase separation proteins using machine learning , 2019, bioRxiv.

[7]  Robert M Vernon,et al.  First-generation predictors of biological protein phase separation. , 2019, Current opinion in structural biology.

[8]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[9]  Jerelle A. Joseph,et al.  Reentrant liquid condensate phase of proteins is stabilized by hydrophobic and non-ionic interactions , 2020, Nature Communications.

[10]  T. Mittag,et al.  Relationship of Sequence and Phase Separation in Protein Low-Complexity Regions. , 2018, Biochemistry.

[11]  A. Aguzzi,et al.  Phase Separation: Linking Cellular Compartmentalization to Disease. , 2016, Trends in cell biology.

[12]  Joshua A. Riback,et al.  Stress-Triggered Phase Separation Is an Adaptive, Evolutionarily Tuned Response , 2017, Cell.

[13]  Hong Lin,et al.  Pi-Pi contacts are an overlooked protein feature relevant to phase separation , 2018, eLife.

[14]  J. Shorter,et al.  The molecular language of membraneless organelles , 2018, The Journal of Biological Chemistry.

[15]  T. Mittag,et al.  Considerations and Challenges in Studying Liquid-Liquid Phase Separation and Biomolecular Condensates , 2019, Cell.

[16]  Peter Tompa,et al.  Polymer physics of intracellular phase transitions , 2015, Nature Physics.

[17]  D. Kalonia,et al.  Effect of Excipients on Liquid-Liquid Phase Separation and Aggregation in Dual Variable Domain Immunoglobulin Protein Solutions. , 2016, Molecular pharmaceutics.

[18]  Jie-rong Huang,et al.  TAR DNA-binding protein 43 (TDP-43) liquid–liquid phase separation is mediated by just a few aromatic residues , 2018, The Journal of Biological Chemistry.

[19]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[20]  A. Hyman,et al.  Germline P Granules Are Liquid Droplets That Localize by Controlled Dissolution/Condensation , 2009, Science.

[21]  Nicolas L. Fawzi,et al.  Protein Phase Separation: A New Phase in Cell Biology. , 2018, Trends in cell biology.

[22]  R. Best,et al.  Biomolecular Phase Separation: From Molecular Driving Forces to Macroscopic Properties. , 2020, Annual review of physical chemistry.

[23]  P. Tompa,et al.  The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. , 2005, Journal of molecular biology.

[24]  R. Winter,et al.  Impact of Macromolecular Crowding and Compression on Protein–Protein Interactions and Liquid–Liquid Phase Separation Phenomena , 2019, Macromolecules.

[25]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[26]  J. Taylor,et al.  Ubiquitin Modulates Liquid-Liquid Phase Separation of UBQLN2 via Disruption of Multivalent Interactions. , 2018, Molecules and Cells.

[27]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[28]  Zhuqing Zhang,et al.  LLPSDB: a database of proteins undergoing liquid–liquid phase separation in vitro , 2019, Nucleic Acids Res..

[29]  Paul S. Russo,et al.  Phase Transitions in the Assembly of MultiValent Signaling Proteins , 2016 .

[30]  A. Hyman,et al.  Beyond stereospecificity: liquids and mesoscale organization of cytoplasm. , 2011, Developmental cell.

[31]  R. Pappu,et al.  A Molecular Grammar Governing the Driving Forces for Phase Separation of Prion-like RNA Binding Proteins , 2018, Cell.

[32]  A. Hyman,et al.  Liquid-liquid phase separation in biology. , 2014, Annual review of cell and developmental biology.

[33]  Sotiris B. Kotsiantis,et al.  Machine learning: a review of classification and combining techniques , 2006, Artificial Intelligence Review.

[34]  H. Hermjakob,et al.  PhaSepDB: a database of liquid–liquid phase separation related proteins , 2019, Nucleic Acids Res..

[35]  C. Brangwynne,et al.  Liquid phase condensation in cell physiology and disease , 2017, Science.

[36]  C. Brangwynne,et al.  RNA transcription modulates phase transition-driven nuclear body assembly , 2015, Proceedings of the National Academy of Sciences.

[37]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[38]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[39]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[40]  R. Pappu,et al.  Valence and patterning of aromatic residues determine the phase behavior of prion-like domains , 2020, Science.