A Distributed Whole Genome Sequencing Benchmark Study

Population sequencing often requires collaboration across a distributed network of sequencing centers for the timely processing of thousands of samples. In such massive efforts, it is important that participating scientists can be confident that the accuracy of the sequence data produced is not affected by which center generates the data. A study was conducted across three established sequencing centers, located in Montreal, Toronto, and Vancouver, constituting Canada’s Genomics Enterprise (www.cgen.ca). Whole genome sequencing was performed at each center, on three genomic DNA replicates from three well-characterized cell lines. Secondary analysis pipelines employed by each site were applied to sequence data from each of the sites, resulting in three datasets for each of four variables (cell line, replicate, sequencing center, and analysis pipeline), for a total of 81 datasets. These datasets were each assessed according to multiple quality metrics including concordance with benchmark variant truth sets to assess consistent quality across all three conditions for each variable. Three-way concordance analysis of variants across conditions for each variable was performed. Our results showed that the variant concordance between datasets differing only by sequencing center was similar to the concordance for datasets differing only by replicate, using the same analysis pipeline. We also showed that the statistically significant differences between datasets result from the analysis pipeline used, which can be unified and updated as new approaches become available. We conclude that genome sequencing projects can rely on the quality and reproducibility of aggregate data generated across a network of distributed sites.

[1]  Bartha M. Knoppers,et al.  A human rights approach to an international code of conduct for genomic and clinical data sharing , 2014, Human Genetics.

[2]  Yun Sung Cho,et al.  Korean Genome Project: 1094 Korean personal genomes with clinical information. , 2020, Science advances.

[3]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[4]  K. Robasky,et al.  The role of replicates for error mitigation in next-generation sequencing , 2013, Nature Reviews Genetics.

[5]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[6]  B. Frey,et al.  Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder , 2017, Nature Neuroscience.

[7]  John G. Cleary,et al.  Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines , 2015, bioRxiv.

[8]  S. Scherer,et al.  Genome-wide detection of tandem DNA repeats that are expanded in autism , 2020, Nature.

[9]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[10]  Michael Brudno,et al.  Whole-genome sequencing expands diagnostic utility and improves clinical management in paediatric medicine , 2016, npj Genomic Medicine.

[11]  Steven J. M. Jones,et al.  A somatic reference standard for cancer genome sequencing , 2016, Scientific Reports.

[12]  Ryan L. Collins,et al.  The mutational constraint spectrum quantified from variation in 141,456 humans , 2020, Nature.

[13]  Manuel Corpas,et al.  Personal Genome Project UK (PGP-UK): a research and citizen science hybrid project in support of personalized medicine , 2018, BMC Medical Genomics.

[14]  J. Hadfield,et al.  RNA sequencing: the teenage years , 2019, Nature Reviews Genetics.

[15]  Thomas Zeng,et al.  Sample Tracking Using Unique Sequence Controls. , 2019, The Journal of molecular diagnostics : JMD.

[16]  Steven J. M. Jones,et al.  Pan-cancer analysis of advanced patient tumors reveals interactions between therapy and genomic landscapes , 2020, Nature Cancer.

[17]  Joaquín Dopazo,et al.  Qualimap: evaluating next-generation sequencing alignment data , 2012, Bioinform..

[18]  Daniele Merico,et al.  Improved diagnostic yield compared with targeted gene sequencing panels suggests a role for whole-genome sequencing as a first-tier genetic test , 2017, Genetics in Medicine.

[19]  Bartha Maria Knoppers,et al.  An International Framework for Data Sharing: Moving Forward with the Global Alliance for Genomics and Health. , 2016, Biopreservation and biobanking.

[20]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[21]  Steven J. M. Jones,et al.  Improved structural variant interpretation for hereditary cancer susceptibility using long-read sequencing , 2020, Genetics in Medicine.

[22]  Lisa J. Strug,et al.  VikNGS: a C++ variant integration kit for next generation sequencing association analysis , 2018, bioRxiv.

[23]  S. Scherer,et al.  Impact of DNA source on genetic variant detection from human whole-genome sequencing data , 2019, Journal of Medical Genetics.

[24]  Ryan L. Collins,et al.  Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes , 2019, bioRxiv.

[25]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[26]  Jennifer A. Tom,et al.  Identifying and mitigating batch effects in whole genome sequencing data , 2017, BMC Bioinformatics.

[27]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[28]  Keith W. Muir,et al.  Whole-genome sequencing of patients with rare diseases in a national health system , 2020, Nature.

[29]  Simon Woods,et al.  International Charter of principles for sharing bio-specimens and data , 2014, European Journal of Human Genetics.

[30]  Steven J. M. Jones,et al.  Comprehensive Characterization of Cancer Driver Genes and Mutations , 2018, Cell.

[31]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[32]  E. Birney,et al.  Challenges and standards in integrating surveys of structural variation , 2007, Nature Genetics.

[33]  Hongbin Zhong,et al.  Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers , 2019, Scientific Reports.

[34]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[35]  PGP-UK Consortium Personal Genome Project UK (PGP-UK): a research and citizen science hybrid project in support of personalized medicine , 2018 .

[36]  Yun Sung Cho,et al.  Korean Genome Project: 1094 Korean personal genomes with clinical information , 2020, Science Advances.

[37]  Kate Voss,et al.  Full-stack genomics pipelining with GATK4 + WDL + Cromwell , 2017 .

[38]  Christopher T. Saunders,et al.  Strelka2: fast and accurate calling of germline and somatic variants , 2018, Nature Methods.

[39]  Daniele Merico,et al.  The Personal Genome Project Canada: findings from whole genome sequences of the inaugural 56 participants , 2018, Canadian Medical Association Journal.