How fast does the SARS-Cov-2 virus really mutate in heterogeneous populations?

We introduce the problem of determining the mutational support of genes in the SARS-Cov-2 virus and estimating the distribution of mutations within different genes using small sample sizes that do not allow for accurate maximum likelihood estimation. The mutational support refers to the unknown number of sites mutated across all strains and individual samples of the SARS-Cov-2 genome; given the high cost and limited availability of real-time polymerase chain reaction (RT-PCR) test kits, especially in early stages of infections when only a small number of genomic samples (∼ 1000s) are available that do not allow for determining the exact degree of mutations in an RNA virus that comprises roughly 30, 000 nucleotides. Nevertheless, working with small sample sets is required in order to quickly predict the mutation rate of this and other viruses and get an insight into their transformational power. Furthermore, with the small number of samples available, it is hard to estimate the mutational landscape across different age/gender groups and geographical locations which may be of great importance in assessing different risk categories and factors influencing susceptibility to infection. To this end, we use our state-of-the art polynomial estimator techniques and the Good-Turing estimator to obtain estimates based on only roughly 1, 000 samples per category. Our analysis reveals an interesting finding: the mutational support appears to be statistically more significant in patients which appear to have lower infection rates and handle the exposure with milder symptoms, such as women and people of relatively young age (≤ 55).

[1]  L. Chao,et al.  Evolvability of an RNA virus is determined by its mutational neighbourhood , 2000, Nature.

[2]  Rafael Sanjuán,et al.  The distribution of fitness effects caused by single-nucleotide substitutions in an RNA virus. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[3]  A. Lauring,et al.  Complexities of Viral Mutation Rates , 2018, Journal of Virology.

[4]  E. Holmes,et al.  Rates of evolutionary change in viruses: patterns and determinants , 2008, Nature Reviews Genetics.

[5]  J. Dinman,et al.  Achieving a Golden Mean: Mechanisms by Which Coronaviruses Ensure Synthesis of the Correct Stoichiometric Ratios of Viral Proteins , 2010, Journal of Virology.

[6]  Alon Orlitsky,et al.  Always Good Turing: Asymptotically Optimal Probability Estimation , 2003, Science.

[7]  William A. Gale,et al.  Good-Turing Frequency Estimation Without Tears , 1995, J. Quant. Linguistics.

[8]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[9]  Jeremy C. Jones,et al.  Influenza vaccines: the good, the bad, and the eggs. , 2010, Advances in virus research.

[10]  J. Drake,et al.  Mutation rates among RNA viruses. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[11]  S. Ghasemi,et al.  Genotype and phenotype of COVID-19: Their roles in pathogenesis , 2020, Journal of Microbiology, Immunology and Infection.

[12]  Alan S. Perelson,et al.  Quantifying the Diversification of Hepatitis C Virus (HCV) during Primary Infection: Estimates of the In Vivo Mutation Rate , 2012, PLoS pathogens.

[13]  Alon Orlitsky,et al.  Competitive Distribution Estimation: Why is Good-Turing Good , 2015, NIPS.

[14]  R. Sanjuán,et al.  Viral Mutation Rates , 2010, Journal of Virology.

[15]  Farzad Farnoud,et al.  Small-sample distribution estimation over sticky channels , 2009, 2009 IEEE International Symposium on Information Theory.

[16]  Yihong Wu,et al.  Chebyshev polynomials, moment matching, and optimal estimation of the unseen , 2015, The Annals of Statistics.

[17]  Rafael Sanjuán,et al.  Mechanisms of viral mutation , 2016, Cellular and Molecular Life Sciences.

[18]  J. J. Bull,et al.  Theory of Lethal Mutagenesis for Viruses , 2007, Journal of Virology.

[19]  K. C. Zoon,et al.  Mutation rate and genotype variation of Ebola virus from Mali case sequences , 2015, Science.

[20]  Olgica Milenkovic,et al.  Regularized Weighted Chebyshev Approximations for Support Estimation , 2019, 1901.07506.

[21]  Yuelong Shu,et al.  GISAID: Global initiative on sharing all influenza data – from vision to reality , 2017, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[22]  Farzad Farnoud,et al.  Alternating Markov chains for distribution estimation in the presence of errors , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[23]  Raul Andino,et al.  Mutational and fitness landscapes of an RNA virus revealed through population sequencing , 2013, Nature.