Modeling SARS-CoV-2 substitution processes: predicting the next variant

We build statistical models to describe the substitution process 6 in the SARS-CoV-2 as a function of explanatory factors describing 7 the sequence, its function, and more. These models serve two differ- 8 ent purposes: first, to gain knowledge about the evolutionary biology 9 of the virus; and second, to predict future mutations in the virus, 10 in particular, non-synonymous amino acid substitutions creating new 11 variants. We use tens of thousands of publicly available SARS-CoV-2 12 sequences and consider tens of thousands of candidate models. 13 Through a careful validation process, we confirm that our chosen 14 models are indeed able to predict new amino acid substitutions: can- 15 didates ranked high by our model are eight times more likely to occur 16 than random amino acid changes. We also show that named variants 17 of interest were highly ranked by our models before their appearance, 18 emphasizing the value of our models for identifying likely variants of 19 interest and potentially utilizing this knowledge in vaccine design and 20 other aspects of the ongoing battle against COVID-19. 21 The intense community effort of SARS-CoV-2 sequencing has yielded a 22 wealth of information about the mutations that have occurred in the virus 23 it first in humans.

[1]  Russell B. Corbett-Detig,et al.  Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic , 2021, Nature Genetics.

[2]  Conor R. Walker,et al.  MUTATION RATES AND SELECTION ON SYNONYMOUS MUTATIONS IN SARS-COV-2 , 2021, bioRxiv.

[3]  D. Hui,et al.  Emergence of a new SARS-CoV-2 variant in the UK , 2020, Journal of Infection.

[4]  A. Pain,et al.  Host-directed editing of the SARS-CoV-2 genome , 2020, Biochemical and Biophysical Research Communications.

[5]  Bethany L. Dearlove,et al.  A SARS-CoV-2 vaccine candidate would likely match all currently circulating variants , 2020, Proceedings of the National Academy of Sciences.

[6]  Benoit Morel,et al.  Phylogenetic Analysis of SARS-CoV-2 Data Is Difficult , 2020, bioRxiv.

[7]  D. Flichman,et al.  Phylogenetic analysis of SARS‐CoV‐2 in the first few months since its emergence , 2020, bioRxiv.

[8]  D. Ramazzotti,et al.  Mutational signatures and heterogeneous host response revealed via large-scale characterization of SARS-CoV-2 genomic diversity , 2020, bioRxiv.

[9]  S. Verma,et al.  Mutational Frequencies of SARS-CoV-2 Genome during the Beginning Months of the Outbreak in USA , 2020, Pathogens.

[10]  Qiang Zhou,et al.  A neutralizing human antibody binds to the N-terminal domain of the Spike protein of SARS-CoV-2 , 2020, Science.

[11]  A. Pain,et al.  Short sequence motif dynamics in the SARS-CoV-2 genome suggest a role for cytosine deamination in CpG reduction , 2020, bioRxiv.

[12]  David Robertson,et al.  CoV-GLUE: A Web Application for Tracking SARS-CoV-2 Genomic Variation , 2020 .

[13]  Dongxiao Liu,et al.  Phylogenetic supertree reveals detailed evolution of SARS-CoV-2 , 2020, Scientific Reports.

[14]  S. Tokajian,et al.  SARS-CoV-2 and ORF3a: Nonsynonymous Mutations, Functional Domains, and Viral Pathogenesis , 2020, mSystems.

[15]  Athanasia Pavlopoulou,et al.  Codon Usage and Phenotypic Divergences of SARS-CoV-2 Genes , 2020, Viruses.

[16]  R. Nielsen,et al.  Synonymous mutations and the molecular evolution of SARS-CoV-2 origins , 2020, bioRxiv.

[17]  Q. Yao,et al.  Human SARS-CoV-2 has evolved to reduce CG dinucleotide in its open reading frames , 2020, Scientific Reports.

[18]  Andrew Rambaut,et al.  Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic , 2020, Nature Microbiology.

[19]  N. Bashir,et al.  COVID-19 infection: Origin, transmission, and characteristics of human coronaviruses , 2020, Journal of Advanced Research.

[20]  M. Kandeel,et al.  From SARS and MERS CoVs to SARS‐CoV‐2: Moving toward more biased codon usage in viral structural and nonstructural genes , 2020, Journal of medical virology.

[21]  L. Poon,et al.  Multivariate analyses of codon usage of SARS-CoV-2 and other betacoronaviruses , 2020, bioRxiv.

[22]  X. Gu,et al.  Evolutionary Dynamics of MERS-CoV: Potential Recombination, Positive Selection and Transmission , 2016, Scientific Reports.

[23]  Ole Tange,et al.  GNU Parallel: The Command-Line Power Tool , 2011, login Usenix Mag..

[24]  J. Hilbe Negative Binomial Regression: Preface , 2007 .

[25]  Miha Vuk,et al.  ROC curve, lift chart and calibration plot , 2006, Advances in Methodology and Statistics.

[26]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[27]  Ziheng Yang,et al.  Estimation of the Transition/Transversion Rate Bias and Species Sampling , 1999, Journal of Molecular Evolution.

[28]  H. Akaike A new look at the statistical model identification , 1974 .

[29]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .