PhyloTransformer: A Discriminative Model for Mutation Prediction Based on a Multi-head Self-attention Mechanism

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused an ongoing pandemic infecting 219 million people as of 10/19/21, with a 3.6% mortality rate. Natural selection can generate favorable mutations with improved fitness advantages;however, the identified coronaviruses may be the tip of the iceberg, and potentially more fatal variants of concern (VOCs) may emerge over time. Understanding the patterns of emerging VOCs and forecasting mutations that may lead to gain of function or immune escape is urgently required. Here we developed PhyloTransformer, a Transformer-based discriminative model that engages a multi-head self-attention mechanism to model genetic mutations that may lead to viral reproductive advantage. In order to identify complex dependencies between the elements of each input sequence, PhyloTransformer utilizes advanced modeling techniques, including a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+) from Performer, and the Masked Language Model (MLM) from Bidirectional Encoder Representations from Transformers (BERT). PhyloTransformer was trained with 1,765,297 genetic sequences retrieved from the Global Initiative for Sharing All Influenza Data (GISAID) database. Firstly, we compared the prediction accuracy of novel mutations and novel combinations using extensive baseline models;we found that PhyloTransformer outperformed every baseline method with statistical significance. Secondly, we examined predictions of mutations in each nucleotide of the receptor binding motif (RBM), and we found our predictions were precise and accurate. Thirdly, we predicted modifications of N-glycosylation sites to identify mutations associated with altered glycosylation that may be favored during viral evolution. We anticipate that PhyloTransformer may guide proactive vaccine design for effective targeting of future SARS-CoV-2 variants.

[1]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[2]  Lisa E. Gralinski,et al.  SARS-CoV-2 D614G variant exhibits efficient replication ex vivo and transmission in vivo , 2020, Science.

[3]  Hyeshik Chang,et al.  The Architecture of SARS-CoV-2 Transcriptome , 2020, Cell.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  S. Iamsirithaworn,et al.  Early transmission patterns of coronavirus disease 2019 (COVID-19) in travellers from Wuhan to Thailand, January 2020 , 2020, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[6]  Ralph S. Baric,et al.  Recombination, Reservoirs, and the Modular Spike: Mechanisms of Coronavirus Cross-Species Transmission , 2009, Journal of Virology.

[7]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[8]  D. A. Jackson,et al.  Evaluating the Effects of SARS-CoV-2 Spike Mutation D614G on Transmissibility and Pathogenicity , 2020, Cell.

[9]  Shuwen Liu,et al.  Structural and functional properties of SARS-CoV-2 spike protein: potential antivirus drug development for COVID-19 , 2020, Acta Pharmacologica Sinica.

[10]  David L Robertson,et al.  No evidence for distinct types in the evolution of SARS-CoV-2 , 2020, Virus evolution.

[11]  J. Abboud,et al.  Orthopaedic Considerations Following COVID-19: Lessons from the 2003 SARS Outbreak. , 2020, JBJS reviews.

[12]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[13]  Z. Memish,et al.  The continuing 2019-nCoV epidemic threat of novel coronaviruses to global health — The latest 2019 novel coronavirus outbreak in Wuhan, China , 2020, International Journal of Infectious Diseases.

[14]  Christopher. Simons,et al.  Machine learning with Python , 2017 .

[15]  Xiaolong Qi,et al.  Real estimates of mortality following COVID-19 infection , 2020, The Lancet Infectious Diseases.

[16]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[17]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[18]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Pardis C Sabeti,et al.  Structural and Functional Analysis of the D614G SARS-CoV-2 Spike Protein Variant , 2020, bioRxiv.

[20]  D. Falzarano,et al.  SARS and MERS: recent insights into emerging coronaviruses , 2016, Nature Reviews Microbiology.

[21]  Joel O. Wertheim,et al.  The emergence of SARS-CoV-2 in Europe and North America , 2020, Science.

[22]  Jianqing Xu,et al.  Key residues of the receptor binding motif in the spike protein of SARS-CoV-2 that interact with ACE2 and neutralizing antibodies , 2020, Cellular & Molecular Immunology.

[23]  S. Rabizadeh,et al.  Molecular dynamic simulation reveals E484K mutation enhances spike RBD-ACE2 affinity and the combination of E484K, K417N and N501Y mutations (501Y.V2 variant) induces conformational change greater than N501Y mutant alone, potentially resulting in an escape mutant , 2021, bioRxiv.

[24]  P. Lemey,et al.  Temporal signal and the phylodynamic threshold of SARS-CoV-2 , 2020, bioRxiv.

[25]  P. Dormitzer,et al.  Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine , 2020, The New England journal of medicine.

[26]  Kai Zhao,et al.  A pneumonia outbreak associated with a new coronavirus of probable bat origin , 2020, Nature.

[27]  L. Abu-Raddad,et al.  Effectiveness of the BNT162b2 Covid-19 Vaccine against the B.1.1.7 and B.1.351 Variants , 2021, The New England journal of medicine.

[28]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[29]  Jesse D. Bloom,et al.  Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding , 2020, bioRxiv.

[30]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31]  M. Vignuzzi,et al.  Coronaviruses Lacking Exoribonuclease Activity Are Susceptible to Lethal Mutagenesis: Evidence for Proofreading and Potential Therapeutics , 2013, PLoS pathogens.

[32]  M. Beltramello,et al.  Circulating SARS-CoV-2 spike N439K variants maintain fitness while evading antibody-mediated immunity , 2021, Cell.

[33]  Zhènglì Shí,et al.  Origin and evolution of pathogenic coronaviruses , 2018, Nature Reviews Microbiology.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  K. Doores,et al.  The HIV glycan shield as a target for broadly neutralizing antibodies , 2015, The FEBS journal.

[36]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[37]  C. Oostenbrink,et al.  Identification of lectin receptors for conserved SARS-CoV-2 glycosylation sites , 2021, bioRxiv.