Advanced Whole Genome Sequencing Using an Entirely PCR-free Massively Parallel Sequencing Workflow

Background Systematic errors can be introduced from DNA amplification during massively parallel sequencing (MPS) library preparation and sequencing array formation. Polymerase chain reaction (PCR)-free genomic library preparation methods were previously shown to improve whole genome sequencing (WGS) quality on the Illumina platform, especially in calling insertions and deletions (InDels). We hypothesized that substantial InDel errors continue to be introduced by the remaining PCR step of DNA cluster generation. In addition to library preparation and sequencing, data analysis methods are also important for the accuracy of the output data.In recent years, several machine learning variant calling pipelines have emerged, which can correct the systematic errors from MPS and improve the data performance of variant calling. Results Here, PCR-free libraries were sequenced on the PCR-free DNBSEQ™ arrays from MGI Tech Co., Ltd. (referred to as MGI) to accomplish the first true PCR-free WGS which the whole process is truly not only PCR-free during library preparation but also PCR-free during sequencing. We demonstrated that PCR-based WGS libraries have significantly (about 5 times) more InDel errors than PCR-free libraries.Furthermore, PCR-free WGS libraries sequenced on the PCR-free DNBSEQ™ platform have up to 55% less InDel errors compared to the NovaSeq platform, confirming that DNA clusters contain PCR-generated errors.In addition, low coverage bias and less than 1% read duplication rate was reproducibly obtained in DNBSEQ™ PCR-free using either ultrasonic or enzymatic DNA fragmentation MGI kits combined with MGISEQ-2000. Meanwhile, variant calling performance (single-nucleotide polymorphisms (SNPs) F-score>99.94%, InDels F-score>99.6%) exceeded widely accepted standards using machine learning (ML) methods (DeepVariant or DNAscope). Conclusions Enabled by the new PCR-free library preparation kits, ultra high-thoughput PCR-free sequencers and ML-based variant calling, true PCR-free DNBSEQ™ WGS provides a powerful solution for improving WGS accuracy while reducing cost and analysis time, thus facilitating future precision medicine, cohort studies, and large population genome projects.

[1]  Lei Shang,et al.  Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants , 2014, Proceedings of the National Academy of Sciences.

[2]  Rémy Bruggmann,et al.  Clinical sequencing: is WGS the better WES? , 2016, Human Genetics.

[3]  C Garmendia,et al.  Highly efficient DNA synthesis by the phage phi 29 DNA polymerase , 1989 .

[4]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[5]  Christopher T. Saunders,et al.  Strelka2: fast and accurate calling of germline and somatic variants , 2018, Nature Methods.

[6]  G. Turcatti,et al.  Solid phase DNA amplification: characterisation of primer attachment and amplification mechanisms. , 2000, Nucleic acids research.

[7]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[8]  Jessica C. Ebert,et al.  Accurate whole genome sequencing and haplotyping from10-20 human cells , 2012, Nature.

[9]  Tomasz Stokowy,et al.  Comparison of three variant callers for human whole genome sequencing , 2018, Scientific Reports.

[10]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[11]  J. Venter,et al.  Library preparation methodology can influence genomic and functional predictions in human microbiome research , 2015, Proceedings of the National Academy of Sciences.

[12]  M. Schatz,et al.  Reducing INDEL calling errors in whole genome and exome sequencing data , 2014, Genome Medicine.

[13]  Jian Wang,et al.  Reliable multiplex sequencing with rare index mis-assignment on DNB-based NGS platform , 2018, BMC Genomics.

[14]  X. Le,et al.  Rolling circle amplification: a versatile tool for chemical biology, materials science and medicine. , 2014, Chemical Society reviews.

[15]  Mikel Hernaez,et al.  Sentieon DNASeq Variant Calling Workflow Demonstrates Strong Computational Performance and Accuracy , 2019, Front. Genet..

[16]  Thomas Colthurst,et al.  A universal SNP and small-indel variant caller using deep neural networks , 2018, Nature Biotechnology.

[17]  Keith A. Boroevich,et al.  Whole-genome mutational landscape and characterization of noncoding and structural mutations in liver cancer , 2016, Nature Genetics.

[18]  Dorothy A. Thompson,et al.  Comprehensive Rare Variant Analysis via Whole-Genome Sequencing to Determine the Molecular Pathology of Inherited Retinal Disease. , 2017, American journal of human genetics.

[19]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[20]  Sunghoon Lee,et al.  Ultra-Fast Next Generation Human Genome Sequencing Data Processing Using DRAGENTM Bio-IT Processor for Precision Medicine , 2017 .

[21]  Ruibang Luo,et al.  A multi-task convolutional deep neural network for variant calling in single molecule sequencing , 2019, Nature Communications.

[22]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[23]  Chittibabu Guda,et al.  A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference , 2015, BioMed research international.

[24]  Ryan M. Layer,et al.  SpeedSeq: Ultra-fast personal genome analysis and interpretation , 2014, Nature Methods.

[25]  Laurie D. Smith,et al.  Whole-genome sequencing for identification of Mendelian disorders in critically ill infants: a retrospective analysis of diagnostic and clinical findings. , 2015, The Lancet. Respiratory medicine.

[26]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[27]  Vikas Bansal,et al.  A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments , 2017, BMC Bioinformatics.

[28]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[29]  Tsunglin Liu,et al.  Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome Assembly , 2013, PloS one.

[30]  Anthony M. Zador,et al.  Sources of PCR-induced distortions in high-throughput sequencing data sets , 2014, bioRxiv.

[31]  Hongbin Zhong,et al.  Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers , 2019, Scientific Reports.

[32]  Jian Wang,et al.  Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly , 2019, Genome research.

[33]  Robert B. Hartlage,et al.  This PDF file includes: Materials and Methods , 2009 .

[34]  Jian Wang,et al.  SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data , 2017, GigaScience.

[35]  Antonis Rokas,et al.  Prevention, diagnosis and treatment of high‐throughput sequencing data pathologies , 2014, Molecular ecology.

[36]  C. Arnold,et al.  Mycobacterium tuberculosis and whole-genome sequencing: how close are we to unleashing its full potential? , 2017, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[37]  Saurabh Baheti,et al.  Performance benchmarking of GATK3.8 and GATK4 , 2018 .

[38]  Insuk Lee,et al.  Systematic comparison of variant calling pipelines using gold standard personal exome variants , 2015, Scientific Reports.

[39]  Z. Ning,et al.  Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes , 2009, Nature Methods.

[40]  Rémy Bruggmann,et al.  New insights into the performance of human whole-exome capture platforms , 2015, Nucleic acids research.