Accelerating SARS-CoV-2 low frequency variant calling on ultra deep sequencing datasets

With recent advances in sequencing technology it has become affordable and practical to sequence genomes to very high depth-of-coverage, allowing researchers to discover low-frequency variants in the genome. However, due to the errors in sequencing it is an active area of research to develop algorithms that can separate noise from the true variants. LoFreq is a state of the art algorithm for low-frequency variant detection but has a relatively long runtime compared to other tools. In addition to this, the interface for running in parallel could be simplified, allowing for multithreading as well as distributing jobs to a cluster. In this work we describe some specific contributions to LoFreq that remedy these issues.

[1]  Hanspeter Pfister,et al.  UpSet: Visualization of Intersecting Sets , 2014, IEEE Transactions on Visualization and Computer Graphics.

[2]  Naomi R. Waterlow,et al.  Increased mortality in community-tested cases of SARS-CoV-2 lineage B.1.1.7 , 2021, Nature.

[3]  Sarah Sandmann,et al.  Evaluating Variant Calling Tools for Non-Matched Next-Generation Sequencing Data , 2017, Scientific Reports.

[4]  Zhen Zhao,et al.  High efficiency error suppression for accurate detection of low-frequency variants , 2019, Nucleic acids research.

[5]  Huw A. Ogilvie,et al.  Hidden genomic diversity of SARS-CoV-2: implications for qRT-PCR diagnostics and transmission , 2020, bioRxiv.

[6]  Robert J. Brunner,et al.  A simple and fast method for computing the Poisson binomial distribution function , 2018, Comput. Stat. Data Anal..

[7]  Yili Hong,et al.  On computing the distribution function for the Poisson binomial distribution , 2013, Comput. Stat. Data Anal..

[8]  J. L. Hodges,et al.  The Poisson Approximation to the Poisson Binomial Distribution , 1960 .

[9]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[10]  A. Wilm,et al.  LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets , 2012, Nucleic acids research.

[11]  Vineet D. Menachery,et al.  Spike mutation D614G alters SARS-CoV-2 fitness , 2020, Nature.

[12]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[13]  F. Balloux,et al.  Emergence of genomic diversity and recurrent mutations in SARS-CoV-2 , 2020, Infection, Genetics and Evolution.

[14]  Evan T. Sholle,et al.  Shotgun transcriptome, spatial omics, and isothermal profiling of SARS-CoV-2 infection reveals unique host responses, viral diversification, and drug interactions , 2021, Nature Communications.

[15]  A. Mentis,et al.  SARS-CoV-2 exhibits intra-host genomic plasticity and low-frequency polymorphic quasispecies , 2020, bioRxiv.