Evolution Is All You Need: Phylogenetic Augmentation for Contrastive Learning

Self-supervised representation learning of biological sequence embeddings alleviates computational resource constraints on downstream tasks while circumventing expensive experimental label acquisition. However, existing methods mostly borrow directly from large language models designed for NLP, rather than with bioinformatics philosophies in mind. Recently, contrastive mutual information maximization methods have achieved state-of-the-art representations for ImageNet. In this perspective piece, we discuss how viewing evolution as natural sequence augmentation and maximizing information across phylogenetic “noisy channels” is a biologically and theoretically desirable objective for pretraining encoders. We first provide a review of current contrastive learning literature, then provide an illustrative example where we show that contrastive learning using evolutionary augmentation can be used as a representation learning objective which maximizes the mutual information between biological sequences and their conserved function, and finally outline rationale for this approach.

[1]  Susana Vinga,et al.  Information theory applications for biological sequence analysis , 2013, Briefings Bioinform..

[2]  Chen Sun,et al.  What makes for good views for contrastive learning , 2020, NeurIPS.

[3]  Michael Tschannen,et al.  On Mutual Information Maximization for Representation Learning , 2019, ICLR.

[4]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[5]  Chen Wang,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[6]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[7]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[8]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[9]  Zhen Cao,et al.  Simple tricks of convolutional neural network architectures improve DNA-protein binding prediction , 2018, Bioinform..

[10]  Lenore Cowen,et al.  Augmented training of hidden Markov models to recognize remote homologs via simulated evolution , 2009, Bioinform..

[11]  Nikhil Naik,et al.  ProGen: Language Modeling for Protein Generation , 2020, bioRxiv.

[12]  Alan M. Moses,et al.  Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization , 2020, bioRxiv.

[13]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[14]  Uwe Ohler,et al.  Deep learning for genomics using Janggu , 2020, Nature Communications.

[15]  Lei Yu,et al.  A Mutual Information Maximization Perspective of Language Representation Learning , 2019, ICLR.

[16]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[17]  Armand Joulin,et al.  Unsupervised Pretraining Transfers Well Across Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Alice C. McHardy,et al.  DeepPrime2Sec: Deep Learning for Protein Secondary Structure Prediction from the Primary Sequences , 2019, bioRxiv.

[19]  Stefano Soatto,et al.  Visual Representations: Defining Properties and Deep Approximations , 2014, ICLR 2016.

[20]  Keisuke Kawano,et al.  Neural Edit Operations for Biological Sequences , 2018, NeurIPS.

[21]  Sindy Löwe,et al.  Putting An End to End-to-End: Gradient-Isolated Learning of Representations , 2019, NeurIPS.

[22]  F. Delsuc Comparative Genomics , 2010, Lecture Notes in Computer Science.

[23]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[24]  Debora S. Marks,et al.  Accelerating Protein Design Using Autoregressive Generative Models , 2019, bioRxiv.

[25]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Burkhard Rost,et al.  Modeling the language of life – Deep Learning Protein Sequences , 2019, bioRxiv.

[27]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[28]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[29]  Bonnie Berger,et al.  Learning protein sequence embeddings using information from structure , 2019, ICLR.

[30]  David Barber,et al.  The IM algorithm: a variational approach to Information Maximization , 2003, NIPS 2003.

[31]  C. Adami,et al.  Evolution of Biological Complexity , 2000, Proc. Natl. Acad. Sci. USA.

[32]  Henrik Nielsen,et al.  Language modelling for biological sequences – curated datasets and baselines , 2020, bioRxiv.

[33]  Burkhard Rost,et al.  ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing , 2020, bioRxiv.

[34]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[35]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[36]  Majid Ghorbani Eftekhar,et al.  Prediction of protein subcellular localization using deep learning and data augmentation , 2020, bioRxiv.

[37]  Lila L. Gatlin,et al.  Information theory and the living system , 1972 .

[38]  S. Varadhan,et al.  Asymptotic evaluation of certain Markov process expectations for large time , 1975 .

[39]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[40]  George M. Church,et al.  Unified rational protein engineering with sequence-only deep representation learning , 2019, bioRxiv.

[41]  Aaron C. Courville,et al.  MINE: Mutual Information Neural Estimation , 2018, ArXiv.

[42]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Burkhard Rost,et al.  End-to-end multitask learning, from protein language to protein features without alignments , 2019, bioRxiv.

[44]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[45]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[46]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[47]  Masashi Sugiyama,et al.  Learning Discrete Representations via Information Maximizing Self-Augmented Training , 2017, ICML.

[48]  Pierre Machart,et al.  Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks , 2020, Nature Communications.

[49]  Alexander A. Alemi,et al.  On Variational Bounds of Mutual Information , 2019, ICML.

[50]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[51]  M. Kimura Natural selection as the process of accumulating genetic information in adaptive evolution , 1961 .

[52]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[53]  E. Kuruoglu,et al.  The information capacity of the genetic code: Is the natural code optimal? , 2017, Journal of theoretical biology.

[54]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.